Observing and Understanding Accept Queues in Linux

All complex systems can, and in my opinion should, be modelled and reasoned about as a collection of queues, caches, connections, and workers. In Linux, this type of abstract modeling is not only just a good idea, it reflects quite literally how the kernel is built.

R e q u e s t s q u e I u n e b o h u F e n I r d F e O C o n Q n u e e c u t e i o n s l i s t e n ( ) a a a c c c c c c e e e p p p t t t ( ( ( ) ) ) S m W s c o s e o o u h n e t r r b a r e k j n d v o e e g e i f w r c e s c o t i e 1 r s b g . k t t a n o e r o s r r a e o s t d f . e g y

The Linux kernel is composed of thousands of interdependent code paths that are capable of producing millions of events per second. These code paths heavily leverage queue and stack mechanics in order to keep the kernel operating smoothly.

// include/net/inet_connection_sock.h v6.2

/** inet_connection_sock - INET connection oriented sock
 *
 * @icsk_accept_queue:	   FIFO of established children
 * @icsk_bind_hash:	   Bind node
 * @icsk_bind2_hash:	   Bind node in the bhash2 table
 * @icsk_timeout:	   Timeout
 * @icsk_retransmit_timer: Resend (no ack)
 * @icsk_rto:		   Retransmit timeout
 * @icsk_pmtu_cookie	   Last pmtu seen by socket
 * @icsk_ca_ops		   Pluggable congestion control hook
 * @icsk_af_ops		   Operations which are AF_INET{4,6} specific
 * @icsk_ulp_ops	   Pluggable ULP control hook
 * @icsk_ulp_data	   ULP private data
 * @icsk_clean_acked	   Clean acked data hook
 * @icsk_ca_state:	   Congestion control state
 * @icsk_retransmits:	   Number of unrecovered [RTO] timeouts
 * @icsk_pending:	   Scheduled timer event
 * @icsk_backoff:	   Backoff
 * @icsk_syn_retries:      Number of allowed SYN (or equivalent) retries
 * @icsk_probes_out:	   unanswered 0 window probes
 * @icsk_ext_hdr_len:	   Network protocol overhead (IP/IPv6 options)
 * @icsk_ack:		   Delayed ACK control data
 * @icsk_mtup;		   MTU probing control data
 * @icsk_probes_tstamp:    Probe timestamp (cleared by non-zero window ack)
 * @icsk_user_timeout:	   TCP_USER_TIMEOUT value
 */
struct inet_connection_sock {
	/* inet_sock has to be the first member! */
	struct inet_sock	  icsk_inet;
	struct request_sock_queue icsk_accept_queue;
	struct inet_bind_bucket	  *icsk_bind_hash;
	struct inet_bind2_bucket  *icsk_bind2_hash;
	unsigned long		  icsk_timeout;
 	struct timer_list	  icsk_retransmit_timer;
 	struct timer_list	  icsk_delack_timer;
	__u32			  icsk_rto;
	__u32                     icsk_rto_min;
	__u32                     icsk_delack_max;
	__u32			  icsk_pmtu_cookie;
	const struct tcp_congestion_ops *icsk_ca_ops;
	const struct inet_connection_sock_af_ops *icsk_af_ops;
	const struct tcp_ulp_ops  *icsk_ulp_ops;
	void __rcu		  *icsk_ulp_data;
	void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
	__u8			  icsk_ca_state:5,
				  icsk_ca_initialized:1,
				  icsk_ca_setsockopt:1,
				  icsk_ca_dst_locked:1;
	__u8			  icsk_retransmits;
	__u8			  icsk_pending;
	__u8			  icsk_backoff;
	__u8			  icsk_syn_retries;
	__u8			  icsk_probes_out;
	__u16			  icsk_ext_hdr_len;
	struct {
		__u8		  pending;	 /* ACK is pending			   */
		__u8		  quick;	 /* Scheduled number of quick acks	   */
		__u8		  pingpong;	 /* The session is interactive		   */
		__u8		  retry;	 /* Number of attempts			   */
		__u32		  ato;		 /* Predicted tick of soft clock	   */
		unsigned long	  timeout;	 /* Currently scheduled timeout		   */
		__u32		  lrcvtime;	 /* timestamp of last received data packet */
		__u16		  last_seg_size; /* Size of last incoming segment	   */
		__u16		  rcv_mss;	 /* MSS used for delayed ACK decisions	   */
	}icsk_ack; //...
}

In Linux, all inbound network requests from an arbitrary client will pass through the kernel backlog queue which is an instance of a request_sock_queue struct and part of the normal TCP handshake. This is true for any socket server (TCP/IPv4, TCP/IPv6, Unix domain, UDP/connectionless) built using the Linux network stack or the /include/net directory in the source tree.

Additionally, there is a more mature and much more exciting structure which is assembled after the TCP handshake is complete which is the complete socket structure.

 // include/linux/net.h v6.2

/**
 *  struct socket - general BSD socket
 *  @state: socket state (%SS_CONNECTED, etc)
 *  @type: socket type (%SOCK_STREAM, etc)
 *  @flags: socket flags (%SOCK_NOSPACE, etc)
 *  @ops: protocol specific socket operations
 *  @file: File back pointer for gc
 *  @sk: internal networking protocol agnostic socket representation
 *  @wq: wait queue for several uses
 */
struct socket {
	socket_state		state;

	short			type;

	unsigned long		flags;

	struct file		*file;
	struct sock		*sk;
	const struct proto_ops	*ops;

	struct socket_wq	wq;
};

Herein lies “The Accept Queue” which can be observed after a request has completed the TCP handshake and exited the “backlog” or “SYN queue”.

Inbound requests may accumulate at runtime which exist in between the moment a server has completed the TCP handshake, and the moment a worker has called accept() to pop the connection pointer off the stack.

As these requests begin to queue, problems arise such as slow user experience or wasted compute resources due to saturated services.

The kernel accept queue is a trivial FIFO queue implementation, with some nuance surrounding TFO or TCP Fast Open which speeds up TCP while also establishing SYN cookies. TFO was originally presented by Google in 2011 TCP Fast Open 2011 PDF and is now the default implementation for opening sockets in the kernel.

If the network stack receives requests at a faster rate than the workers can process the requests, the accept queue grows.

In the follow model, a worker is any arbitrary service that communicates with the networking stack using accept(2) which can be used after a service has called listen(3) to begin accepting inbound connections.

C o n n e c t l i i o s n t s e n ( ) Q u e u e a a a c c c c c c e e e p p p t t t ( ( ( ) ) ) W W W W o o o o r r r r k k k k e e e e r r r r s 1 2 3

For example a unicast service accepting inbound TCP connections in Go which references the system call functions directly, as the Go programming language does not use in a libc implementation such as glibc. Notice how the server first calls net.Listen() and later calls l.Accept() passing each connection off to a new goroutine.

func main() {
	// Source: https://pkg.go.dev/net#example-Listener
    
	// Listen on TCP port 2000 on all available unicast and
	// anycast IP addresses of the local system.
	l, err := net.Listen("tcp", ":2000")
	if err != nil {
		log.Fatal(err)
	}
	defer l.Close()
	for {
		// Wait for a connection.
		conn, err := l.Accept()
		if err != nil {
			log.Fatal(err)
		}
		// Handle the connection in a new goroutine.
		// The loop then returns to accepting, so that
		// multiple connections may be served concurrently.
		go func(c net.Conn) {
			// Echo all incoming data.
			io.Copy(c, c)
			// Shut down the connection.
			c.Close()
		}(conn)
	}
}

Different servers will have different strategies for removing inbound requests from the accept queue for processing based on the implementation detail of the server. For example the Apache HTTP Server notably will hand requests off to a worker thread, while NGINX is event based and workers will process events based as they come in and workers are available for processing.

Note: The Apache server is often claimed to “spawn a thread per request”, which is not necessarily an accurate claim. Apache calls out MaxConnectionsPerChild which only would spawn a “thread per request” if set to a value of 1.

One of the primary drivers for NGINX’s event based worker strategy is the need to process more throughput at runtime using a reasonable amount of resources. NGINX’s design is intended to introduce nonlinear scalability in terms of connects and requests per second. NGINX accomplishes this by reducing the amount of overhead it takes to process a request from the queue. NGINX uses event based architecture and strong concurrency patterns to ensure that workers are ready to call accept() and handle a request as efficiently as possible.

NGINX Reverse Proxy

Recently I performed a small amount of analysis on NGINX reverse proxy servers which was able to demonstrate the behavior of NGINX given known dysfunctional upstream servers in which I was able to calculate the “Active Connections” metric produced by the popular stub status module as:

Field Description
Q Number of items in Linux accept queue.
A Number of active connections currently being processed by NGINX.
1 The GET request used to query the stub status module itself.
somaxconn Arbitrary limit for accept queues either set by a user, or default to 1024

Note that NGINX operates with a monolithic listen() statement in the master process, and that accept() events are operated on by worker threads as events are produced within NGINX.

l a a i c c s c c t e e e p p n t t ( ( ( ) ) ) ; ; ; B a c N k g l i o n g x N N Q M g g u a i i e s n n u Q t x x e e r W W o o P r r s r k k o o e e m c r r a e x s T T c s h h o r r n e e n a a d d 1 2 A c c e p A t c e t d i v C e o n C n A o e n c n t e i c o t n i s o n s

$$ Active Connections = \sum_{Q + A + 1} $$

There were 2 key takeaways from my work that are relevant in this discussion. Specifically on how NGINX is able to set an upper limit on the kernel accept queues described above.

  1. NGINX manages an internal accept queue limit known also known as “the backlog queue” which the implementation can be seen in /src/core/ngx_connection.c and defaults to 511 + 1 on Linux. Also see Tuning NGINX.
  2. The NGINX backlog queue can be set to an arbitrary values by modifying kernel parameters using sysctl(8); net.core.somaxconn and net.core.netdev_max_backlog.

It is important to note that even despite raising the upper limit of the accept queue using sysctl(8), the NGINX worker_connections directive can still impose an upper limit on connections to the server at large even if there is plenty of available room in the accept queue buffers.

Regardless of which limit (accept queue, backlog queue, or worker connections) was exceeded, I was able to demonstrate NGINX returning 5XX level HTTP responses simply by setting the various limits low enough and exceeding the limits with curl requests in a simple bash loop.

Performance

Despite this analysis being exciting from a conceptual perspective to any engineer hoping to operate a web server without finding their service vulnerable to a denial of service attack. The implications on the accept queue and performance are even more exciting to understand.

On a discreet compute system, the longer an inbound request sits in an accept queue with idle system resources, the less performant your server implementation is. In other words the more accept queuing that can be observed without simultaneously correlating CPU utilization also at capacity, the more time your computers are sitting around doing nothing when they could otherwise be processing traffic.

Performance engineers understand these key points as Utilization and Saturation. Utilization is the measurement of how well utilized your system resources are compared to load. Saturation is the point in which you have received more load than your current services can process and queuing can be observed.

Observing Accept Queues

Now – the question remains how does one observe the state of these queues? More importantly: when would you want to?

In order to observe the queues you will first want to understand which specific accept queues you believe to be interesting on your servers.

Specifically there are 4 types of connections that most enterprise services will find interesting:

  • TCP/IPv4
  • TCP/IPv6
  • Unix domain
  • UDP (connectionless)

In this example we are interested in observing the moment a TCP/IPV4 connection is appended to an accept queue, as well as the moment a connection is removed from an accept queue. Demonstrating the queue accumulating connections behavior alone requires a special environment where a server is listening for connections, but will not accept the connections. Then a simple client tool such a curl can be used to connect to the dysfunctional server.

To do this we need to instrument the functions of the kernel where specific events occur using Extended BPF eBPF and a special utility known as kprobes.

For TCP/IPv4 and TCP/IPv6 instrumentation on a 6.2 kernel here is what I use.

Kernel Function Source Description Specific Field Layer
tcp_conn_request inet_connection_sock.c queue++ *sock.sk_ack_backlog 4/TCP
inet_csk_accept tcp_input.c queue– sock.sk_ack_backlog 3/IP
inet_csk_listen_start inet_connection_sock.c N/A Informative 3/IP

In my opinion it is important to measuring queues when an element is added, as well as removed from the queue. This ensures accurate reporting to the total lifecycle of a given accumulation period such that an element can not be added or removed silently.

This methodology of measuring when an element is added, and removed is important because kprobes are executed inline with existing kernel functions. In other words, the only way to surfacing the values out of the kernel with kprobes is for something to actually exercise the code that adds and removes elements from the queue.

Tools such as Python BCC make this exercise fairly trivial.

# observe.py
from bcc import BPF
BPF(text='int kprobe__tcp_conn_request(struct request_sock_ops *rsk_ops, const struct tcp_request_sock_ops *af_ops, struct sock *sk, struct sk_buff *skb) { bpf_trace_printk("qlen: *sock.sk_ack_backlog"); return 0; }').trace_print()

Additionally the ss(8) command makes extremely quick work of this exercise.

# ss -lnt
State  Recv-Q Send-Q      Local Address:Port  Peer Address:PortProcess
LISTEN 17      4096              0.0.0.0:80         0.0.0.0:*

I wrote a working example of an eBPF kprobe implementation in Rust as I intend on adding more advanced metrics in the future that shows the detail and some more of my research in a project called q. The q project contains a directory /servers which houses a set of dysfunctional servers written in C that can be used to simulate the metrics.

Regardless of where you are surfacing the data, there will be trade-offs and limitations to what specific metrics you are interested in. The kernel networking stack isn’t as complicated as you might think as soon as you are sufficiently above net device (tcpdump, wireshark, network devices, etc).

A quick brush up on the relationship between Linux system calls and the TCP handshake makes quick work of understanding the relationship between listen() and accept().

TCP is a stateful protocol, and the connections must exist somewhere while we wait for SYN,ACK from the server.

In our case, this place is the Linux backlog queue which can be a pain to learn about the hard way in the event your servers are no longer accepting new connections.

S e p F t V W V I e F A N \ m S R s I I r W b Y C C n N T c A r s e N V L d - v I c n r D O 1 T v d S F A - 1 E I C 2 F A 9 N K I C 8 x N K 1 o f r c F v I s N p r n A a c c d C r s r v K c s e S x s r s v T i a S Y o C n c n C v t Y N f L d v d A P e e N , O C A S S F F A K C O T C Y E I I C x o P C K N N N K o n E B f n N e F T c F V r s V V C I I t i C L c n E L N M i g L I v d S O E o u O S T S V n r S T S A A I W e E E Y C B N A S D N N K V G I t 6 T a . d T t e i d e C l r m e T L e s c r s e l D r O t S n v s c n o e i a S e E d n v d u t a n E N S d t e g s T D S Y F A = r m F C Y N A I C 2 T a i u B N , C N K M C m s n A K S B s c d C L i t e K o i C l r n o L e c n O t v C a S e o l E A n a T C t S c c C K r p t r s B s x o e i e n C n o l c v a d L d L f C i e t C O A L P f e S S S L W S F S F O r i O Y Y E O A E I T I S o c P T N N N S I N - N E t a E C T E T A D o t N B V V C V c i K o o l n

More Resources