Reliable and performant socket layers

January 1, 0001

Some notes from building a socket layer for network appliances (in this case a SIP loadbalancer). 5 9's and all that. Should generalise. I've been meaning to write these thoughts down for a while…

As always, most of this can be derived by thinking a bunch, and having a clear set of priorities, and knowing your budgets/what's minimally acceptable.

Key concerns

Packet throughput.
Robust operation during overload (don't crash, no OOMing).
- Most things can fail in some way, so a strategy is needed to deal with them.
Fast failover, and fast recovery (for UDP and TCP there are no excuses, TLS handshakes are expensive enough that they can limit recovery time depending on hardware).
Concurrency and asyncronicity.
Priorities and failure detection.
Predictability.

Packet throughput

Mostly this is a case of not doing stupid (slow) things. Keep everything non-blocking, the pool of threads sensible for the cores/tasks available, and queue to pass buffers between threads. Use an asynchronous polling system like epoll.

Use the cache reasonably well.

Robustness

Memory usage

Bound queues to prevent overload and head-of-line blocking from driving memory usage into the OOM killer. OOMing is dumb. You don't have unlimited memory, so stop pretending. Even if memory is bounded because you're using a buffer pool (see below), bounding queues is still wortwhile to stop single queues consuming all buffers etc.

Preallocated buffer pools keeps memory allocation bounded. Memory allocation is slow anyway. Buffers running out is a good signal to start graceful and fast rejection of new requests (preferrable to crashing). Make sure you have buffers for reading from sockets and sending error messages (particularly important for trunking connections).

Other components on your system may be eating a substantial proportion of available memory.

Failure detection

Under load healthcheck probes can get lost in the flood. Schedule handling these separately (different threads and sockets that can be prioritized over others). Without this, applications needing fast (e.g. <2 s) failure response can end up flip-flopping between health states because probes get lost in the noise. Note that TLS handshakes are obviously capable of starving threads, and even TCP connection setup/failure storms.

Modelling entities

Note that endpoints are not the same as connections. Protocols like SIP expect UACs to try and reconnect when a connection is dropped without dropping state. A single connection may carry multiple streams (see also TCP head-of-line blocking and associated pain).

Watchdogs

Healthcheck everything critical. Have something send packets through your system to check it is alive, and respond quickly to failures.

Who watches the watchdogs? Watchdogs get simpler as they go up the supervision tree.

If using systemd disable all back-off timers and prioritize restarting the application.

Kubernetes back-off timers will eat your annual outage budget in one sitting, so if using this, pray failures don't spread across other instances (or perhaps don't use kubernetes).

Note also that core dumps can increase recovery time. In extreme times it may be worth temporarily disabling them.

Metrics and alerting, of course.

Error handling

Don't use exceptions. Exceptions are dumb, exceptions are slow. In particular errors on sockets are expected and not exceptional, and firing exceptions can cause starvation on failover (firing an exception on a failed TCP connection can easily disrupt other more important work). Explicit error codes are much cleaner, simple and orders of magnitude more performant.

Concurrency and asynchronicity

Threading models

Object models are dumb, but if you must, thread ownership dominates object ownership ALWAYS (and hence object models are dumb).

Fast lock(wait)-free queues aren't linearizable (they're essentially bundles of single-producer-single-consumer queues), so don't reason temporally.

Life is easier if you keep reading and writing to a socket on the same thread, and shard your sockets across threads. However, this isn't necessary, but lifetimes will need careful management.

Make sure file descriptors are cleaned up. Conceptually, the operating system is like another thread.

Have an ownership model you can reason about. If you must lock, having a locking model you can reason about.

Don't spawn threads willy-nilly.

TLS pain

Note OpenSSL uses thread local state for TLS sockets, so you need to initialize the socket on the thread you want to run it on.

OpenSSL also uses global error queues, so make sure you clear them after every possibly erroring operation. Not doing this can cause a single error on a single socket to starve proper operation of all TLS sockets. You won't see the error on the socket that caused the problem, because you didn't check for errors (which would have cleared the queue).

TLS handshakes are expensive, and the connection setup-rate is bounded on compute. Measure and provision threads accordingly. Measure across different key sizes.

Back-pressure and recovery

Asynchronicity is great and all, but as soon as you get some blocking on a socket (e.g. your TCP connection is backing up), queues can start filling. Have a strategy to make sure single slow endpoints don't up holding all the buffers (a simple way may be bounding the number of buffers queued on a single endpoint).

Notes on protocols

UDP

UDP is much easier in many ways. SIP has per-session reliability mechanisms built on top of UDP. This is much better than TCPs.

TCP

Heavy TCP connections easily run into problems with head-of-line blocking even with relatively low packet drop.

Fiddling with kernel settings can be helpful, but won't overcome problems in design.

TLS

Handshakes are computationally expensive.

Testing

Test it all, obviously, and stress it over long periods of time. Make testing easy. Make it easy to create a fresh test environment (you will break things).