High availability & ARP
At my day job, I develop safety related systems on Linux that are expected to provide very high levels of uptime and fault tolerance. The system’s guys have a top level requirement that everything must be dual redundant, and so the word has come down from on high that our corner of the world must comply. We’ve been given 3 main requirements from this:
- We will provide our machines in pairs
- Each pair will share an ‘active’ IP address
- Failovers between the machines should be seamless
The first and final points are fairly standard, but the shared address one is new to me. In the previous similarly dual redundant system I worked on, all components communicating with the pair were aware of its duplicitous nature, and handled the redundancy with their own special considerations at the application layer.
Putting aside the standard set of high availability complexities (synchronisation, arbitration, etc), we are left with the data link layer interactions required to quickly move an IP address from one machine to another without service interruption. I should mention that we provide UDP interfaces to all of the systems we interact with, so we don’t have to worry about reestablishing connections at switchover. Because our overall system design is quite focussed on robustness, we can also happily accept a datagram or two’s worth of loss on any interface, without being deemed unavailable.
Down to the nitty gritty. When we decide to do a switchover, we are swapping the active and dormant IP addresses between our two machines. For each machine:
- Assign the new IP address to the sub interface
- Enter a IP filtering rule blocking all traffic originating at the old address
- Broadcast a gratuitous ARP with our new address
- Rebind all of our sockets