Fix taskmaster-max-accept-count related behavior
Several issues had to be considered to end up with the correct behavior: 1. Threading behavior: taskmaster-max-accept-count was not respected because handle-incoming-connection was executed in the main acceptor thread. As soon as taskmaster-request-count reached taskmaster-max-thread-count, the thread was put to sleep. The new connections were therefore not accepted, eventually reaching the acceptor listen backlog limit and being silently droped by the kernel (instead of returning an HTTP 503 answer). FIX: The handler thread is now spawned earlier. So the main acceptor thread is not blocked. 2. Counter behavior: Moreover, even with the thread spawned earlier, the behavior was not valid: in handle-incoming-connection, when taskmaster-request-count reached taskmaster-max-thread-count, the current handling thread was put to sleep, without incrementing taskmaster-request-count first. Thus taskmaster-request-count would never reach taskmaster-max-accept-count and never trigger the too-many-taskmaster-requests behavior (and sending an HTTP 503 answer). The acceptor would therefore keep accepting new connections until eventually exhausting the available amount of file descriptors (fds) allowed for each process (given an heavy enough load). 2.1 SUGGESTED FIX: incrementing taskmaster-request-count earlier This solution is not acceptable. Although it would actually fix the fd exhaustion and adequatly trigger the too-many-taskmaster-requests behavior, we would expose ourselves to the following scenario: Let's take taskmaster-max-thread-count = 100 and taskmaster-max-accept-count = 120 (which are the default values). Let's assume 140 requests are received by the server quickly enough for the last request to be accepted before the first processing is complete, but not quickly enough for the number of not-yet-accepted connections to reach the acceptor listen backlog (iow, all requests are given to handle-incoming-connection more or less simultaneously): The first 100 requests will start being processed right away (taskmaster-request-count < taskmaster-max-thread-count), the next 20 requests will be put into a waiting state (taskmaster-max-thread-count <= taskmaster-request-count < taskmaster-max-accept-count) and the last 20 requests will be rejected with an HTTP 503 answer (taskmaster-request-count >= taskmaster-max-accept-count). taskmaster-request-count will now have the value 120. Once the first 10 requests processing are completed, it will be decreased to 110 (with 20 threads still in waiting state and 90 requests being processed). Although note-free-connection will have been called (10 times), the condition in wait-for-free-connection (taskmaster-request-count < taskmaster-max-thread-count) will never be satisfied and no thread will be woken up. Until 11 more requests processing are completed and taskmaster-request-count finaly reaches 99, at which point, one thread will be woken up. For a grand total of 80 requests being processed while 20 others are still sleeping (!). The requests will therefore be put to sleep much longer than they would need. Even worse, if taskmaster-max-accept-count > 2*taskmaster-max-thread-count, there will always be too many sleeping threads for taskmaster-request-count to ever reach taskmaster-max-thread-count and hunchentoot will eventually reject any new request without processing any old request currently waiting. 2.2 FIX: using 2 separated counters, one for accepted connections, one for processed connections We now use two different counters. While taskmaster-request-count still count the number of requests being processed (and we didn't change its increment/decrement strategy, we created another counter, taskmaster-accept-count, which is counting the number of requests that have been currently accepted (iow, the REAL number of thread currently spawned, either being processed or in a waiting state). This counter is incremented by handle-incoming-connection% as soon as we fork from the acceptor thread and is decremented before the end of thread, either in send-service-unavailable-reply or when the number taskmaster-request-count is decremented. 3. Acceptor listen backlog race condition Of course, if the load is so heavy that hunchentoot receives requests faster than it can send HTTP 503 answers, the acceptor listen backlog will eventually be reached and connections will be silently closed. To fix this issue would mean to set an extremely high listen backlog (or disable it altogether). However, that would delay the moment by which hunchentoot fall back to a stable state after the heavy load spike because it would have to send an HTTP 503 answer for each and every connection waiting to be accepted. This solution is therefore perceived as increasing the probability of hunchentoot failling to process queries in case of consecutives spike and slowing down the recovery process. This behavior is thus deemed as acceptable, if not wanted, in order to guarantee a better reliability under consecutive spike/heavy load.
parent
7508f9a3
Please register or sign in to comment