General retry and reconnect guidelines
Your applications interacting with Bosch IoT Manager should be able to bridge small interruptions or latencies, for example in near-zero downtime deployments or load restrictions. On the one hand, this requirement is imposed by the fact that we work with distributed systems and in a distributed system an endpoint might become unavailable shortly due to networking issues, redeployments, restarts, reassignment of resources etc. On the other hand, your application will also highly depend on the network services of other providers and should not break in case they are not able to keep connections with our infrastructure, at least for a short time.
In order to be able to identify cases where your request might not have reached our service, we recommend orienting on the exception you might receive upon request.
The status code returned by the Bosch IoT Manager service might help to indicate whether the failure is transient or not. You might need to examine the exceptions generated by the client in use.
REST API responses
When using the REST API of Bosch IoT Manager, you may encounter the responses listed below. Consider retry accordingly.
Response |
Description |
404 |
Resource not found This status code generally indicates that there is no such resource. However, retry could be useful because the resource might exist, yet the system might be currently rebooting and the REST could be temporarily unavailable, so this could be a potential reason for the error. |
408 |
Request timeout As this status code indicates request timeout, retry may be useful. |
429 |
Too many requests This status code indicates that a limit has been hit, therefore retry may be useful if in the meantime some place in the queue has been vacated. Check the exponential back off section below for the recommended intervals. |
500 |
Internal Server Error Not always but in some cases this status code might indicate that there is service downtime. Therefore a retry might be helpful. |
502 |
Bad gateway This status code indicates that the server, while acting as a gateway or proxy, received an invalid response from the upstream server. Therefore a retry might be helpful. |
503 |
Service unavailable This status code indicates that the service is temporarily unavailable. Therefore a retry might be helpful. |
504 |
Gateway timeout This status code indicates that the server, while acting as a gateway or proxy, did not receive a timely response from an upstream server it needed to access in order to complete the request. Therefore a retry might be helpful. |
Java API responses
When using the Java API of Bosch IoT Manager (via gRPC) you should build your retry/reconnect logic around the following responses, which can be generally split into two sections.
Reconnect
It may be useful to perform an auto-reconnect if you receive the following gRPC response:
Response |
Description |
UNAVAILABLE |
The service is currently unavailable. This is a most likely a transient condition and may be corrected by retrying with a backoff. Note that it is not always safe to retry non-idempotent operations. |
Retry
It may be useful to retry sending the same request if you receive any of the following Manager responses:
Response |
Description |
INTERNAL_ERROR |
The server encountered an unexpected condition that prevented it from fulfilling the request. |
TIMEOUT |
The server did not receive a complete request message within the time that it was prepared to wait. |
TOO_MANY_REQUESTS |
The user has sent too many requests in a given amount of time ("rate limiting"). |
TEMPORARY_UNAVAILABLE |
The server is temporarily unavailable. |
It may be useful to make some changes to your request and then retry it, if you receive the following response:
Response |
Description |
EXPIRED |
The authentication credentials expired for the target resource. You can retry but first should reconnect with an updated authentication token. |
Exponential back off for retrying calls to the Bosch IoT Manager APIs
Applications connected to Bosch IoT Manager should implement means to not produce unnecessary load on the cloud components in case of transient failures. A transient failure could be e.g. a temporary unavailability of the service or loss of connection. In such cases highly frequent retries are counter-productive and should be avoided.
The most common approach to handle retries in such cases is to implement an exponential back off. There are different client libraries implementing such back off, e.g. in the java ecosystem the following is common: https://resilience4j.readme.io/docs/retry.
Consider the following recommended intervals:
Initial retry: 500 milliseconds (The pause to wait at minimum after a previously successful action failed.)
Minimum retry value: 500 milliseconds (The pause to wait at least in between retry attempts, which can be doubled on each attempt.)