High Availability Design

Download OpenAPI specification:Download

ha_design_0

High Availability Design

This document provides guidelines for building mission-critical, end-to-end services using the capabilities of the iotcomms.io platform. A service's availability is only as strong as its weakest link. In real-world deployments, unexpected failures must be accounted for to maintain service reliability. A key principle in designing resilient services is redundancy: multiple redundant nodes should be available, and if communication with one node fails, the system should automatically retry with another node.

Platform Resilience

To ensure a reliable service, the iotcomms.io platform is built on multiple nodes hosting different services. Each service runs on several active nodes deployed across multiple geographic regions. If one region experiences an outage, traffic is automatically routed to nodes in another region.

The platform continuously monitors node utilization and dynamically scales capacity as needed. If a node fails, it is automatically detected, removed from service, and replaced with a new one, ensuring seamless operation without manual intervention.

Service Discovery

DNS plays a fundamental role in routing traffic to active service nodes. The iotcomms.io platform publishes active nodes in public DNS, automatically updating records when nodes are added or removed.

For SIP-based services, service discovery follows the mechanisms outlined in RFC 3263. This standard defines how SIP clients use DNS to locate SIP servers and retry another available server if a request fails.

Similarly, the iotcomms.io platform leverages internal DNS for routing traffic between its internal nodes. Clients capable of resolving multiple DNS entries and implementing retry mechanisms can fully benefit from the platform's high availability design. Clients that do not support this functionality may experience service degradation, and such failures are not counted against the SLA.

DNS Caching and TTL Considerations

To ensure timely resolution of service endpoints, clients should respect the DNS Time-To-Live (TTL) settings and avoid excessive DNS caching. Failure to do so may result in outdated or unreachable service endpoints, leading to connectivity issues. Regular DNS resolution helps maintain high availability by ensuring that clients always connect to the most current and active service nodes.

API Call Resiliency

This section outlines best practices for building a resilient service when interacting with iotcomms.io APIs.

API Requests

If an API request to the iotcomms.io platform fails or times out, the client should retry the request up to five times, using an exponential backoff strategy. The only exception to this rule is if the API returns a 404 Not Found response, as this indicates an invalid URL rather than a transient failure.

It is important that any system or device connecting to the iotcomms.io services implements retry functionality when a request fails. This ensures that temporary network issues or transient failures do not disrupt service availability.

Callbacks

Callbacks sent from the iotcomms.io platform are automatically retransmitted in the event of an error response or connection timeout, except in cases where a 404 Not Found status is received.

Monitoring and Logging

Effective monitoring and logging are crucial for maintaining high availability. Customers should continuously monitor their application logs for API failures, service alarms, and other unexpected behavior. This proactive approach helps detect and mitigate potential issues before they impact service reliability.

Additionally, real-time monitoring of API requests, device registrations, and callback success rates provides valuable insights into the health of the service. Logging mechanisms should be designed to capture detailed error information, allowing for efficient troubleshooting and resolution of failures.

Alarmbridge High Availability

This section provides design recommendations for achieving high availability when using the Alarmbridge capabilities of the platform. Alarmbridge is typically used to connect alarm devices with alarm receiving platforms.

Alarmbridge implements the retransmission logic required by the protocol in use. For example, alarm events that are not acknowledged will be retransmitted based on protocol-specific rules.

Alarm Device Behavior

Alarm devices should implement their own retry mechanisms in cases where alarm transmissions are not acknowledged. By performing retries, the device can eventually establish a successful communication path through available nodes in the platform.

Devices using SIP-based alarm protocols, such as SCAIP, must utilize DNS mechanisms as described earlier to resolve SIP servers and ensure reliable service.

SCAIP to Analog Voice Bridge Calls

When bridging from SCAIP to analog protocols using Alarmbridge, the following steps occur:

  1. The alarm device sends a SCAIP request to the service. If acknowledged, the device is instructed to place a voice call to a designated Voicebridge number. If the request is not acknowledged, the device should retry using another resolved SIP server for the domain.

  2. The alarm device connects to the Voicebridge number, which plays a ringback tone until the analog event is acknowledged. If the alarm transmission fails, the call is disconnected. In such cases, the alarm device should restart the process from step 1.

Service Alarms

Unexpected failures, such as failed API callback requests or other unforeseen disruptions, can trigger an API callback alarm. Customers are strongly encouraged to monitor and handle these alarms proactively, as they may indicate misconfigurations or integration issues requiring attention.

By implementing these guidelines, services built on the iotcomms.io platform can achieve high availability, ensuring reliability and resilience in mission-critical applications.