General retry and reconnect guidelines for applications and devices

Your devices and applications interacting with the Bosch IoT Suite need to be able to bridge small interruptions or latencies.

On the one hand, this requirement is imposed by the fact that we work with distributed systems. On the other hand, your devices and application will also highly depend on the network services of other providers and should not break in case they are not able to keep connections with our infrastructure; at least for short time.

In order to be able to identify cases, where your request might not have reached our service or your physical devices, find some recommendation of the service layer in use.

Recommendations by the device connectivity layer
Click here to expand...
Reconnect to an endpoint

The following requirements apply to an IoT device:
- It must be able to detect that a connection or the context of sending a message (for connection-less protocols) is dropped. See Client connection drop detection.
- It must have the ability to reconnect (including the initial connection approach).
- It must maintain a meaningful behavior for reconnect. See Exponential back off for retrying device actions.
For IoT devices a minimum of three connection attempts within at least three minutes needs to be considered as re-connection time and not as downtime.

Retry sending a message

The device interacts with the following patterns with the device connectivity layer and each provides different means of messaging guarantees towards the device.
- Telemetry messages
  - QoS 0
    It is guaranteed by the device connectivity layer that a high throughput is reached, but any message might get lost. The device receives a response directly from protocol adapter, and does not wait for confirmation from other components about successful message processing. To handle these circumstances the IoT devices must consider three cases:
    
    Successful response from protocol adapter - indicates that protocol adapter tried to forward the message.
    The message however, might still get lost in the delivery process later on. Thus, the application must consider this fact in the interpretation of the status of an IoT device etc..
    
    Error response from protocol adapter - indicates that the message was received by the protocol adapter, but could not be forwarded due to an internal system error, or parts of the system being temporarily unavailable, or a message limit was exceeded for the tenant .
    In this case, the IoT device should in the sense of QoS 0 not perform a retry. As we assume with QoS 0 a very high amount of messages is sent in general a retry policy may cause message congestion.
    
    No response (timeout) from protocol adapter - indicates that the connection from the IoT device to the protocol adapter might be interrupted.
    See Reconnect to an endpoint.
  - QoS 1
    It is guaranteed by the device connectivity layer that a high throughput is reached, and the message will be persisted to short term storage in different availability zones to lower the risk of data lost.
    The p rotocol adapter waits for confirmation, whether the message has been persisted to short term storage, and after that it provides the response to the device, which leads to a longer response time in comparison with telemetry QoS 0 messages.
    
    Successful response from protocol adapter - indicates that the message was persisted within the the device connectivity layer, and is available to be fetched by the digital twin layer.
    This does however not guarantee that the message has actually been processed by the digital twin layer service or by the receiving application which must be considered by the IoT device.
    
    Error response from protocol adapter - indicates that the message was received by the protocol adapter, but could not be persisted due to an internal system error, or a message limit was exceeded for the tenant, or for a more specific case if the message queue is exceeded.
    In the error response case, the IoT device should perform a retry with a back off mechanism, depending on the received error signal. See Reaction of received error status.
    
    No response (timeout) from protocol adapter - indicates that the connection from the IoT device to the protocol adapter might be interrupted.
    See Reconnect to an endpoint.
- Event messages
  the device connectivity layer implements different technical means to ensure as much as possible the delivery of event messages to their consumers. That is, an event message will be e.g. persisted in different availability zones to lower the risk of data lost. This allows event messages - also in edge cases like application crashes - to be delivered to their consumers. The trade-off is, that only a lower throughput might be reached compared to QoS 1 Telemetry.
  - Successful response from protocol adapter - indicates that the message was persisted within the the device connectivity layer, and is available to be fetched by the digital twin layer .
    This does however not guarantee, that the message is actually processed by the receiving application which must be considered by the IoT device. (For information regarding the actual processing acknowledgments of a message in the IoT Suite context more information can be found on Acknowledgements).
  - Error response from protocol adapter - indicates that the message was received by the protocol adapter, but could not be persisted due to an internal system error, or a message limit was exceeded for the tenant, or for a more specific case if the message queue is exceeded.
    In the error response case, the IoT device should perform a retry with a back off mechanism, depending on the received error signal. See Reaction of received error status.
  - No response (timeout) from protocol adapter - indicates that the connection from the IoT device to the protocol adapter might be interrupted.
    See Reconnect to an endpoint.
- Command response messages
  It is guaranteed by the device connectivity layer that a medium throughput is reached and the message will be persisted to short term storage in different availability zones to lower the risk of data lost. The protocol adapter waits on confirmation, whether the command response has been persisted to short term storage, and after that provides the response to the device.
  - Successful response from protocol adapter - indicates that the command response was persisted within the the device connectivity layer and is available to be fetched by the digital twin layer .
    This does however not guarantee, that the command response has actually been processed by the receiving application which must be considered by the IoT device.
  - Error response from protocol adapter - indicates that the message was received by the protocol adapter, but could not be persisted due to an internal system error, or a message limit was exceeded for the tenant, or for a more specific case if the message queue is exceeded.
    In the error response case, the IoT device should perform a retry with a back off mechanism, depending on the received error signal. See Reaction of received error status.
  - No response (timeout) from protocol adapter - indicates that the connection from the IoT device to the protocol adapter might be interrupted.
    See Reconnect to an endpoint.
If the repeated sending of a device message is not possible for a period longer than one minute, then this time is considered as downtime.

Retry subscribing for commands

For receiving commands an IoT device needs to subscribe itself which is protocol specific.

If the subscription fails (e.g. timeout or error response due to internal system error) the IoT device must implement a retry behavior with a back off mechanism depending on the received error signal. See Reaction of received error status.

If the subscription of a device for command messages is not possible for a period longer than one minute, then this time is considered as downtime.

Reaction of received error status

In case of errors the the device connectivity layer will return in general an error code indicating the occurred problem.

The amount and granularity of available error codes depends on the used protocol.

Some protocols allow more verbose signaling of the problem (e.g. AMQP 1.0), while some others are more simpler and can therefore not convey as much information in case of errors (e.g. MQTT v3).

the device connectivity layer error codes can be assigned to one of the following groups:
- Client errors
  - An error code indicating something with the request itself is wrong.
    Example: The maximum allowed message length is exceed.
  - A retry is in general not recommended, as a retry with an unchanged message will not lead to success.
- Well-known server errors
  - the device connectivity layer returns a well-known / defined error code indicating a specific problem.
    Example: The message limit for the current invoice period of the tenant is reached.
  - The client should implement proper handling of all of the available well-know error codes.
    The list of available error codes is provided in the documentation of the respective protocol adapter.
  - Depending on the error situation this handling can include (not limited):
    
    Retry after fixed amount of time,
    
    Retry with exponential back off (see Exponential back off for retrying device actions),
    
    Reconnect,
    
    Drop of the message,
    
    Triggering of further actions.
- Generic server errors
  - the device connectivity layer returns a generic error code indicating a cloud problem.
    Example: HTTP code 500 is returned to a client, because an transient internal error occurred during processing.
  - The devices can perform a retry of the operation using an exponential back off (see Exponential back off for retrying device actions).
Exponential back off for retrying device actions

Devices and clients connected to the device connectivity layer should implement means to not produce unnecessary load on the cloud components in case of transient failures.

A transient failure could be e.g. a temporary unavailability of the service or lost of a connection.

In such cases, high frequent retries are counter-productive and should be avoided.

The most common approach to handle retries in such cases is to implement an Exponential backoff, there are different client libraries implementing such back off, e.g. in the Java ecosystem the following is common: https://resilience4j.readme.io/docs/retry.

In general the device connectivity layer recommends the following retry behavior:

Type: Exponential back off

Initial retry: 200 milliseconds (The pause to wait at minimum after a previously successful action failed.)

Minimum retry value: 200 milliseconds (The pause to wait at least in between retry attempts.)

Maximum retry value: 300 seconds (The maximum pause to wait between retry attempts.)

Depending on the used protocol (HTTP, MQTT, CoAP, etc.) the according protocol specification might contain further recommendations which should be considered as well.

Client connection drop detection

the device connectivity layer offers clients to connect by various protocols, and each of those protocols got individual means how to establish a connection.

Some protocols might use a new connection for each request (e.g. HTTP, without connection-reuse), some might be able to reuse connections (e.g. MQTT) for multiple requests.

In case of protocols which allow reusing an existing connections devices must be able to handle connection drops.

The specification of the used protocol contains in general information how to detect and handle such cases.

Most commonly a keep alive mechanism is used on the protocol layer.

The protocol adapter specific implementation / configuration notes are documented on the respective protocol adapter documentation page. Refer to MQTT adapter - Inactivity timeout as an example.
Recommendations by the digital twin layer
Click here to expand...
Your devices and applications interacting with the digital twin layer need to be able to bridge small interruptions or latencies. On the one hand, this requirement is imposed by the fact that we work with distributed systems. On the other hand, your devices and application will also highly depend on the network services of other providers and should not break in case they are not able to keep connections with our infrastructure; at least for short time.

In order to be able to identify cases, where your request might not have reached our service, we recommend orienting on the exception you might receive upon a request.
The status code returned from the the digital twin layer service might help to indicate, whether the failure is transient or not. You might need to examine the exceptions generated by the client in use. For example, the Ditto client provides exceptions in HTTP status code semantics, such that you can analyze and interpret the exception.

The following HTTP status codes typically indicate that a re-try is appropriate:
- 408 Request Timeout
- 424 Dependency Failed
- 429 Too Many Requests
- 500 Internal Server Error
- 502 Bad Gateway
- 503 Service Unavailable
- 504 Gateway Timeout
No matter whether you request our service via the the digital twin layer HTTP API, WebSockets, or Ditto protocol, in case of exceptions we provide the same status codes.

All usage of our service by APIs should apply a systematic approach for managing re-tries incl. an exponential back-off, as well as re-connects.
Ideally, you should prepare for reusable code, so that you can apply a consistent methodology across all clients and all applications.

Recommendations by the mass device management layer

Click here to expand...

Your applications interacting with Bosch IoT Manager should be able to bridge small interruptions or latencies, for example in near-zero downtime deployments or load restrictions. On the one hand, this requirement is imposed by the fact that we work with distributed systems and in a distributed system an endpoint might become unavailable shortly due to networking issues, redeployments, restarts, reassignment of resources etc. On the other hand, your application will also highly depend on the network services of other providers and should not break in case they are not able to keep connections with our infrastructure, at least for a short time.

In order to be able to identify cases where your request might not have reached our service, we recommend orienting on the exception you might receive upon request.

The status code returned by the Bosch IoT Manager service might help to indicate whether the failure is transient or not. You might need to examine the exceptions generated by the client in use.

REST API responses

When using the REST API of Bosch IoT Manager, you may encounter the responses listed below. Consider retry accordingly.

Response	Description
404	Resource not found This status code generally indicates that there is no such resource. However, retry could be useful because the resource might exist, yet the system might be currently rebooting and the REST could be temporarily unavailable, so this could be a potential reason for the error.
408	Request timeout As this status code indicates request timeout, retry may be useful.
429	Too many requests This status code indicates that a limit has been hit, therefore retry may be useful if in the meantime some place in the queue has been vacated. Check the exponential back off section below for the recommended intervals.
500	Internal Server Error Not always but in some cases this status code might indicate that there is service downtime. Therefore a retry might be helpful.
502	Bad gateway This status code indicates that the server, while acting as a gateway or proxy, received an invalid response from the upstream server. Therefore a retry might be helpful.
503	Service unavailable This status code indicates that the service is temporarily unavailable. Therefore a retry might be helpful.
504	Gateway timeout This status code indicates that the server, while acting as a gateway or proxy, did not receive a timely response from an upstream server it needed to access in order to complete the request. Therefore a retry might be helpful.

Java API responses

When using the Java API of Bosch IoT Manager (via gRPC) you should build your retry/reconnect logic around the following responses, which can be generally split into two sections.

Reconnect

It may be useful to perform an auto-reconnect if you receive the following gRPC response:

Response	Description
`UNAVAILABLE`	The service is currently unavailable. This is a most likely a transient condition and may be corrected by retrying with a backoff. Note that it is not always safe to retry non-idempotent operations.

Retry

It may be useful to retry sending the same request if you receive any of the following Manager responses:

Response	Description
`INTERNAL_ERROR`	The server encountered an unexpected condition that prevented it from fulfilling the request.
`TIMEOUT`	The server did not receive a complete request message within the time that it was prepared to wait.
`TOO_MANY_REQUESTS`	The user has sent too many requests in a given amount of time ("rate limiting").
`TEMPORARY_UNAVAILABLE`	The server is temporarily unavailable.

It may be useful to make some changes to your request and then retry it, if you receive the following response:

Response	Description
`EXPIRED`	The authentication credentials expired for the target resource. You can retry but first should reconnect with an updated authentication token.

Exponential back off for retrying calls to the Bosch IoT Manager APIs

Applications connected to Bosch IoT Manager should implement means to not produce unnecessary load on the cloud components in case of transient failures. A transient failure could be e.g. a temporary unavailability of the service or loss of connection. In such cases highly frequent retries are counter-productive and should be avoided.

The most common approach to handle retries in such cases is to implement an exponential back off. There are different client libraries implementing such back off, e.g. in the java ecosystem the following is common: https://resilience4j.readme.io/docs/retry.

Consider the following recommended intervals:

Initial retry: 500 milliseconds (The pause to wait at minimum after a previously successful action failed.)

Minimum retry value: 500 milliseconds (The pause to wait at least in between retry attempts, which can be doubled on each attempt.)

Recommendations by the software update layer
Click here to expand...
Your devices and applications interacting with Bosch IoT Rollouts need to be able to bridge small interruptions or latencies. On the one hand, this requirement is imposed by the fact that we work with distributed systems, where short interruptions can result from a networking issue, a maintenance activity, or a service incident. On the other hand, your devices and application will also highly depend on the network services of other providers and should not break in case they are not able to keep connections with our infrastructure; at least for short time.

In order to be able to identify cases where your request might not have reached our service, we recommend orienting on the exception you might receive upon request. The status code returned by the Bosch IoT Rollouts service might help to indicate whether the failure is transient or not.

The following HTTP status codes typically indicate that a re-try is appropriate:
- 409 Conflict
- 429 Too Many Requests
- 500 Internal Server Error
- 502 Bad Gateway
- 503 Service Unavailable
- 504 Gateway Timeout
No matter whether you request our service via the Bosch IoT Rollouts Management API or DDI API, in case of exceptions we provide the same status codes.

All usage of our APIs should apply a systematic approach for managing re-tries incl. an exponential back-off, as well as re-connects. This helps to avoid unnecessary load, that can become counterproductive.

Recommendations by the device connectivity layer

Reconnect to an endpoint

Retry sending a message

Retry subscribing for commands

Reaction of received error status

Exponential back off for retrying device actions

Client connection drop detection