Retry Guide for devices

This guide contains best practices to be considered when developing IoT devices (clients) to be connected to the Bosch IoT Hub. It is thus aimed at device implementers.

In a distributed system an endpoint might become unavailable shortly due to networking issues, redeployments, restarts, reassignment of resources etc.. Thus the IoT device that connects to an endpoint must be able to cope with this behavior.

Furthermore, a message might get lost and not be received by the intended receiver without implementing additional measures. The Bosch IoT Hub thus offers different mechanisms that help to solve this issue. They must be used in the intended way.

Reconnect to an endpoint

The following requirements apply to an IoT device:

Note: For IoT devices a minimum of three connection attempts within at least three minutes needs to be considered as reconnection time and not as downtime.

Retry sending a message

The device interacts with the following patterns with Bosch IoT Hub and each provides different means of messaging guarantees towards the device.

  • Telemetry messages

    • QoS 0

      It is guaranteed by Bosch IoT Hub that a high throughput is reached but any message might get lost. To handle these circumstances the IoT devices must consider three cases:

      • Successful response from protocol adapter indicates that the message was tried to be forwarded. The message however might still get lost in the delivery process later on. Thus the application must consider this fact in the interpretation of the status of an IoT device etc..

      • Error response from protocol adapter indicates that the message was received by the protocol adapter but could not be forwarded due to an internal system error or if no application is currently connected to the AMQP interface. In this case the IoT device should in the sense of QoS 0 not perform a retry. As we assume with QoS 0 a very high amount of messages is sent in general a retry policy may cause message congestion.

      • No response (timeout) from protocol adapter indicates that the connection from the IoT device to the protocol adapter might be interrupted. See Reconnect to an endpoint.

    • QoS 1

      It is guaranteed by Bosch IoT Hub that a medium throughput is reached and any loss of messages within the Bosch IoT Hub system will be reported.

      • Successful response from protocol adapter indicates that the message was forwarded and accepted by the receiving application. This does however not guarantee that the message has actually been processed by the receiving application which must be considered by the IoT device.

      • Error response from protocol adapter indicates that the message was received by the protocol adapter but could not be forwarded due to an internal system error or if no application is currently connected to the AMQP interface or a connected application does not acknowledge received messages. In the error response case the IoT device should perform a retry with a back off mechanism depending on the received error signal. See Reaction of received error status.

    • No response (timeout) from protocol adapter indicates that the connection from the IoT device to the protocol adapter might be interrupted. See Reconnect to an endpoint.

  • Event messages

    Bosch IoT Hub implements different technical means to ensure as much as possible the delivery of event messages to their consumers. That is, an event message will be e.g. persisted in different availability zones to lower the risk of data lost. This allows event messages also in edge cases like application crashes to be delivered to their consumers. The trade-off is that only a lower throughput might be reached compared to QoS 1 Telemetry.

    • Successful response from protocol adapter indicates that the message was persisted within the Bosch IoT Hub and is available to be fetched by the receiving application. This does however not guarantee that the message is actually processed by the receiving application which must be considered by the IoT device. (For information regarding the actual processing acknowledgments of a message in the IoT Suite context more information can be found on: Bosch IoT Things Acknowledgements ).

    • Error response from protocol adapter indicates that the message was received by the protocol adapter but could not be persisted due to an internal system error or for a more specific case if the message queue is exceeded if no application is currently connected to the AMQP interface or a connected application does not acknowledge received messages. In the error response case the IoT device should perform a retry with a back off mechanism depending on the received error signal. See Reaction of received error status.

    • No response (timeout) from protocol adapter indicates that the connection from the IoT device to the protocol adapter might be interrupted. See Reconnect to an endpoint.

  • Command response messages

    It is guaranteed by Bosch IoT Hub that any loss of command responses within the Bosch IoT Hub system will be reported.

    • Successful response from protocol adapter indicates that the command response was forwarded and accepted by the receiving application. This does however not guarantee that the command response has actually been processed by the receiving application which must be considered by the IoT device.

    • Error response from protocol adapter indicates that the command response was received by the protocol adapter but could not be forwarded due to an internal system error or if no application is currently connected to the AMQP interface or a connected application does not acknowledge received command responses. In the error response case the IoT device should perform a retry with a back off mechanism depending on the received error signal. See Reaction of received error status.

    • No response (timeout) from protocol adapter indicates that the connection from the IoT device to the protocol adapter might be interrupted. See Reconnect to an endpoint.

Note: If the repeated sending of a device message is not possible for a period longer than one minute, then this time is considered as downtime.

Retry subscribing for commands

For receiving commands an IoT device needs to subscribe itself which is protocol specific.

If the subscription fails (e.g. timeout or error response due to internal system error) the IoT device must implement a retry behavior with a back off mechanism depending on the received error signal. See Reaction of received error status.

Note: If the subscription of a device for command messages is not possible for a period longer than one minute, then this time is considered as downtime.

Reaction of received error status

In case of errors the Bosch IoT Hub will return in general an error code indicating the occurred problem.

The amount and granularity of available error codes is depending on the used protocol.

Some protocols allow more verbose signaling of the problem (e.g. AMQP 1.0) while some others are more simpler and can therefore not convey as much information in case of errors (e.g. MQTT v3).

Bosch IoT Hub error codes can be assigned to one of the following groups:

  • Client errors

    • An error code indicating something with the request itself is wrong.

      Example: The maximum allowed message length is exceed.

    • A retry is in general not recommended, as a retry with an unchanged message will not lead to a success.

  • Well-known server errors

    • Bosch IoT Hub returns a well-known / defined error code indicating a specific problem.

      Example: The message limit for the current invoice period of the tenant is reached.

    • The client should implement proper handling of all of the available well-know error codes.

      The list of available error codes is provided in the documentation of the respective protocol adapter.

    • Depending on the error situation this handling can include (not limited):

  • Generic server errors

    • Bosch IoT Hub returns a generic error code indicating a cloud problem.

      Example: HTTP code 500 is returned to a client, as an transient internal error occurred during processing.

    • The devices can perform a retry of the operation using an exponential back off (see Exponential back off for retrying device actions).

Exponential back off for retrying device actions

Devices and clients connected to Bosch IoT Hub should implement means to not produce unnecessary load on the cloud components in case of transient failures.

A transient failure could be e.g. a temporary unavailability of the service or lost of a connection.

In such cases high frequent retries are counter-productive and should be avoided.

The most common approach to handle retries in such cases is to implement an Exponential backoff, there are different client libraries implementing such back off, e.g. in the java ecosystem the following is common: https://resilience4j.readme.io/docs/retry.

In general Bosch IoT Hub recommends the following retry behavior:

Type: Exponential back off

Initial retry: 200 milliseconds (The pause to wait at minimum after a previously successful action failed.)

Minimum retry value: 200 milliseconds (The pause to wait at least in between retry attempts.)

Maximum retry value: 300 seconds (The maximum pause to wait between retry attempts.)

Depending on the used protocol (HTTP, MQTT, CoAP, etc.) the according protocol specification might contain further recommendations which should be considered as well.

Client connection drop detection

Bosch IoT Hub offers clients to connect by various protocols, and each of those protocols got individual means how to establish a connection.

Some protocols might use a new connection for each request (e.g. HTTP (without connection-reuse)), some might be able to reuse connections (e.g. MQTT) for multiple requests.

In case of protocols which allow reusing an existing connections devices must be able to handle connection drops.

The specification of the used protocol contains in general information how to detect and handle such cases.

Most commonly a keep alive mechanism is used on the protocol layer.

The protocol adapter specific implementation / configuration notes are documented on the respective protocol adapter documentation page. See Inactivity Timeout in MQTT Adapter as an example.