General retry and reconnect guidelines for applications and devices
Your devices and applications interacting with the Bosch IoT Suite need to be able to bridge small interruptions or latencies.
On the one hand, this requirement is imposed by the fact that we work with distributed systems. On the other hand, your devices and application will also highly depend on the network services of other providers and should not break in case they are not able to keep connections with our infrastructure; at least for short time.
In order to be able to identify cases, where your request might not have reached our service or your physical devices, find some recommendation of the service layer in use.
-
Recommendations by the device connectivity layer
Click here to expand...Reconnect to an endpoint
The following requirements apply to an IoT device:
It must be able to detect that a connection or the context of sending a message (for connection-less protocols) is dropped. See Client connection drop detection.
It must have the ability to reconnect (including the initial connection approach).
It must maintain a meaningful behavior for reconnect. See Exponential back off for retrying device actions.
For IoT devices a minimum of three connection attempts within at least three minutes needs to be considered as re-connection time and not as downtime.
Retry sending a message
The device interacts with the following patterns with the device connectivity layer and each provides different means of messaging guarantees towards the device.
Telemetry messages
QoS 0
It is guaranteed by the device connectivity layer that a high throughput is reached, but any message might get lost. The device receives a response directly from protocol adapter, and does not wait for confirmation from other components about successful message processing. To handle these circumstances the IoT devices must consider three cases:Successful response from protocol adapter - indicates that protocol adapter tried to forward the message.
The message however, might still get lost in the delivery process later on. Thus, the application must consider this fact in the interpretation of the status of an IoT device etc..Error response from protocol adapter - indicates that the message was received by the protocol adapter, but could not be forwarded due to an internal system error, or parts of the system being temporarily unavailable, or a message limit was exceeded for the tenant .
In this case, the IoT device should in the sense of QoS 0 not perform a retry. As we assume with QoS 0 a very high amount of messages is sent in general a retry policy may cause message congestion.No response (timeout) from protocol adapter - indicates that the connection from the IoT device to the protocol adapter might be interrupted.
See Reconnect to an endpoint.
QoS 1
It is guaranteed by the device connectivity layer that a high throughput is reached, and the message will be persisted to short term storage in different availability zones to lower the risk of data lost.
The p rotocol adapter waits for confirmation, whether the message has been persisted to short term storage, and after that it provides the response to the device, which leads to a longer response time in comparison with telemetry QoS 0 messages.Successful response from protocol adapter - indicates that the message was persisted within the the device connectivity layer, and is available to be fetched by the digital twin layer.
This does however not guarantee that the message has actually been processed by the digital twin layer service or by the receiving application which must be considered by the IoT device.Error response from protocol adapter - indicates that the message was received by the protocol adapter, but could not be persisted due to an internal system error, or a message limit was exceeded for the tenant, or for a more specific case if the message queue is exceeded.
In the error response case, the IoT device should perform a retry with a back off mechanism, depending on the received error signal. See Reaction of received error status.No response (timeout) from protocol adapter - indicates that the connection from the IoT device to the protocol adapter might be interrupted.
See Reconnect to an endpoint.
Event messages
the device connectivity layer implements different technical means to ensure as much as possible the delivery of event messages to their consumers. That is, an event message will be e.g. persisted in different availability zones to lower the risk of data lost. This allows event messages - also in edge cases like application crashes - to be delivered to their consumers. The trade-off is, that only a lower throughput might be reached compared to QoS 1 Telemetry.Successful response from protocol adapter - indicates that the message was persisted within the the device connectivity layer, and is available to be fetched by the digital twin layer .
This does however not guarantee, that the message is actually processed by the receiving application which must be considered by the IoT device. (For information regarding the actual processing acknowledgments of a message in the IoT Suite context more information can be found on Acknowledgements).Error response from protocol adapter - indicates that the message was received by the protocol adapter, but could not be persisted due to an internal system error, or a message limit was exceeded for the tenant, or for a more specific case if the message queue is exceeded.
In the error response case, the IoT device should perform a retry with a back off mechanism, depending on the received error signal. See Reaction of received error status.No response (timeout) from protocol adapter - indicates that the connection from the IoT device to the protocol adapter might be interrupted.
See Reconnect to an endpoint.
Command response messages
It is guaranteed by the device connectivity layer that a medium throughput is reached and the message will be persisted to short term storage in different availability zones to lower the risk of data lost. The protocol adapter waits on confirmation, whether the command response has been persisted to short term storage, and after that provides the response to the device.
Successful response from protocol adapter - indicates that the command response was persisted within the the device connectivity layer and is available to be fetched by the digital twin layer .
This does however not guarantee, that the command response has actually been processed by the receiving application which must be considered by the IoT device.Error response from protocol adapter - indicates that the message was received by the protocol adapter, but could not be persisted due to an internal system error, or a message limit was exceeded for the tenant, or for a more specific case if the message queue is exceeded.
In the error response case, the IoT device should perform a retry with a back off mechanism, depending on the received error signal. See Reaction of received error status.No response (timeout) from protocol adapter - indicates that the connection from the IoT device to the protocol adapter might be interrupted.
See Reconnect to an endpoint.
If the repeated sending of a device message is not possible for a period longer than one minute, then this time is considered as downtime.
Retry subscribing for commands
For receiving commands an IoT device needs to subscribe itself which is protocol specific.
If the subscription fails (e.g. timeout or error response due to internal system error) the IoT device must implement a retry behavior with a back off mechanism depending on the received error signal. See Reaction of received error status.
If the subscription of a device for command messages is not possible for a period longer than one minute, then this time is considered as downtime.
Reaction of received error status
In case of errors the the device connectivity layer will return in general an error code indicating the occurred problem.
The amount and granularity of available error codes depends on the used protocol.
Some protocols allow more verbose signaling of the problem (e.g. AMQP 1.0), while some others are more simpler and can therefore not convey as much information in case of errors (e.g. MQTT v3).
the device connectivity layer error codes can be assigned to one of the following groups:
Client errors
An error code indicating something with the request itself is wrong.
Example: The maximum allowed message length is exceed.A retry is in general not recommended, as a retry with an unchanged message will not lead to success.
Well-known server errors
the device connectivity layer returns a well-known / defined error code indicating a specific problem.
Example: The message limit for the current invoice period of the tenant is reached.The client should implement proper handling of all of the available well-know error codes.
The list of available error codes is provided in the documentation of the respective protocol adapter.Depending on the error situation this handling can include (not limited):
Retry after fixed amount of time,
Retry with exponential back off (see Exponential back off for retrying device actions),
Reconnect,
Drop of the message,
Triggering of further actions.
Generic server errors
the device connectivity layer returns a generic error code indicating a cloud problem.
Example: HTTP code 500 is returned to a client, because an transient internal error occurred during processing.The devices can perform a retry of the operation using an exponential back off (see Exponential back off for retrying device actions).
Exponential back off for retrying device actions
Devices and clients connected to the device connectivity layer should implement means to not produce unnecessary load on the cloud components in case of transient failures.
A transient failure could be e.g. a temporary unavailability of the service or lost of a connection.
In such cases, high frequent retries are counter-productive and should be avoided.
The most common approach to handle retries in such cases is to implement an Exponential backoff, there are different client libraries implementing such back off, e.g. in the Java ecosystem the following is common: https://resilience4j.readme.io/docs/retry.
In general the device connectivity layer recommends the following retry behavior:
Type: Exponential back off
Initial retry: 200 milliseconds (The pause to wait at minimum after a previously successful action failed.)
Minimum retry value: 200 milliseconds (The pause to wait at least in between retry attempts.)
Maximum retry value: 300 seconds (The maximum pause to wait between retry attempts.)
Depending on the used protocol (HTTP, MQTT, CoAP, etc.) the according protocol specification might contain further recommendations which should be considered as well.
Client connection drop detection
the device connectivity layer offers clients to connect by various protocols, and each of those protocols got individual means how to establish a connection.
Some protocols might use a new connection for each request (e.g. HTTP, without connection-reuse), some might be able to reuse connections (e.g. MQTT) for multiple requests.
In case of protocols which allow reusing an existing connections devices must be able to handle connection drops.
The specification of the used protocol contains in general information how to detect and handle such cases.
Most commonly a keep alive mechanism is used on the protocol layer.
The protocol adapter specific implementation / configuration notes are documented on the respective protocol adapter documentation page. Refer to MQTT adapter - Inactivity timeout as an example.
-
Recommendations by the digital twin layer
Click here to expand...Your devices and applications interacting with the digital twin layer need to be able to bridge small interruptions or latencies. On the one hand, this requirement is imposed by the fact that we work with distributed systems. On the other hand, your devices and application will also highly depend on the network services of other providers and should not break in case they are not able to keep connections with our infrastructure; at least for short time.
In order to be able to identify cases, where your request might not have reached our service, we recommend orienting on the exception you might receive upon a request.
The status code returned from the the digital twin layer service might help to indicate, whether the failure is transient or not. You might need to examine the exceptions generated by the client in use. For example, the Ditto client provides exceptions in HTTP status code semantics, such that you can analyze and interpret the exception.The following HTTP status codes typically indicate that a re-try is appropriate:
408 Request Timeout
424 Dependency Failed
429 Too Many Requests
500 Internal Server Error
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
No matter whether you request our service via the the digital twin layer HTTP API, WebSockets, or Ditto protocol, in case of exceptions we provide the same status codes.
All usage of our service by APIs should apply a systematic approach for managing re-tries incl. an exponential back-off, as well as re-connects.
Ideally, you should prepare for reusable code, so that you can apply a consistent methodology across all clients and all applications. -
Recommendations by the mass device management layer
Click here to expand...Your applications interacting with Bosch IoT Manager should be able to bridge small interruptions or latencies, for example in near-zero downtime deployments or load restrictions. On the one hand, this requirement is imposed by the fact that we work with distributed systems and in a distributed system an endpoint might become unavailable shortly due to networking issues, redeployments, restarts, reassignment of resources etc. On the other hand, your application will also highly depend on the network services of other providers and should not break in case they are not able to keep connections with our infrastructure, at least for a short time.
In order to be able to identify cases where your request might not have reached our service, we recommend orienting on the exception you might receive upon request.
The status code returned by the Bosch IoT Manager service might help to indicate whether the failure is transient or not. You might need to examine the exceptions generated by the client in use.
REST API responses
When using the REST API of Bosch IoT Manager, you may encounter the responses listed below. Consider retry accordingly.
Response
Description
404
Resource not found
This status code generally indicates that there is no such resource. However, retry could be useful because the resource might exist, yet the system might be currently rebooting and the REST could be temporarily unavailable, so this could be a potential reason for the error.
408
Request timeout
As this status code indicates request timeout, retry may be useful.
429
Too many requests
This status code indicates that a limit has been hit, therefore retry may be useful if in the meantime some place in the queue has been vacated. Check the exponential back off section below for the recommended intervals.
500
Internal Server Error
Not always but in some cases this status code might indicate that there is service downtime. Therefore a retry might be helpful.
502
Bad gateway
This status code indicates that the server, while acting as a gateway or proxy, received an invalid response from the upstream server. Therefore a retry might be helpful.
503
Service unavailable
This status code indicates that the service is temporarily unavailable. Therefore a retry might be helpful.
504
Gateway timeout
This status code indicates that the server, while acting as a gateway or proxy, did not receive a timely response from an upstream server it needed to access in order to complete the request. Therefore a retry might be helpful.
Java API responses
When using the Java API of Bosch IoT Manager (via gRPC) you should build your retry/reconnect logic around the following responses, which can be generally split into two sections.
Reconnect
It may be useful to perform an auto-reconnect if you receive the following gRPC response:
Response
Description
UNAVAILABLE
The service is currently unavailable. This is a most likely a transient condition and may be corrected by retrying with a backoff. Note that it is not always safe to retry non-idempotent operations.
Retry
It may be useful to retry sending the same request if you receive any of the following Manager responses:
Response
Description
INTERNAL_ERROR
The server encountered an unexpected condition that prevented it from fulfilling the request.
TIMEOUT
The server did not receive a complete request message within the time that it was prepared to wait.
TOO_MANY_REQUESTS
The user has sent too many requests in a given amount of time ("rate limiting").
TEMPORARY_UNAVAILABLE
The server is temporarily unavailable.
It may be useful to make some changes to your request and then retry it, if you receive the following response:
Response
Description
EXPIRED
The authentication credentials expired for the target resource. You can retry but first should reconnect with an updated authentication token.
Exponential back off for retrying calls to the Bosch IoT Manager APIs
Applications connected to Bosch IoT Manager should implement means to not produce unnecessary load on the cloud components in case of transient failures. A transient failure could be e.g. a temporary unavailability of the service or loss of connection. In such cases highly frequent retries are counter-productive and should be avoided.
The most common approach to handle retries in such cases is to implement an exponential back off. There are different client libraries implementing such back off, e.g. in the java ecosystem the following is common: https://resilience4j.readme.io/docs/retry.
Consider the following recommended intervals:
Initial retry: 500 milliseconds (The pause to wait at minimum after a previously successful action failed.)
Minimum retry value: 500 milliseconds (The pause to wait at least in between retry attempts, which can be doubled on each attempt.)
-
Recommendations by the software update layer
Click here to expand...Your devices and applications interacting with Bosch IoT Rollouts need to be able to bridge small interruptions or latencies. On the one hand, this requirement is imposed by the fact that we work with distributed systems, where short interruptions can result from a networking issue, a maintenance activity, or a service incident. On the other hand, your devices and application will also highly depend on the network services of other providers and should not break in case they are not able to keep connections with our infrastructure; at least for short time.
In order to be able to identify cases where your request might not have reached our service, we recommend orienting on the exception you might receive upon request. The status code returned by the Bosch IoT Rollouts service might help to indicate whether the failure is transient or not.
The following HTTP status codes typically indicate that a re-try is appropriate:
409 Conflict
429 Too Many Requests
500 Internal Server Error
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
No matter whether you request our service via the Bosch IoT Rollouts Management API or DDI API, in case of exceptions we provide the same status codes.
All usage of our APIs should apply a systematic approach for managing re-tries incl. an exponential back-off, as well as re-connects. This helps to avoid unnecessary load, that can become counterproductive.