When I give system design interviews, candidates that start adding queues reflexively to the design always do poorly.
Queueing is only useful for a few cases, IMO:
* The request is expensive to reject. For example, the inputs to the rejected request also came from expensive requests or operations (like a file upload). So rejecting the request because of load will multiply the load on other parts of the system. You still need backpressure or forwardpressure (autoscaling).
* Losing a request is expensive, delaying the result is not. Usually you want a suitably configured durable queueing system (e.g. Kafka) if you have this scenario.
* A very short queue is acceptable if it's necessary that downstream resources are kept 100% busy. A good example of this is in a router, the output to a slower link might queue 1-2 packets so that there is always something to send, which maximizes throughput.
* If you have very bursty traffic, you can smooth the bursts to fit in your capacity. But this runs the danger of having the queue always full, which you have to manage with load shedding (either automated or manual).
----
An underappreciated queue type is LIFO (last-in, first-out). It sounds unfair, but it keeps you from moving the median response time at the cost of the maximum response time, and it behaves well when full. It fails over into either responding quickly or just rejecting requests when full, so it works well for dealing with bursty traffic.
That reminds me of this talk[0] by Gil Tene called "How NOT to Measure Latency" at the Strangeloop conference in 2015 (or read this blog post[1] that contains the most important points).
The author speculates about ways to deal with an overloaded queue.
Kingmans Formula says that as you approach 100% utilization, waiting times explode.
The correct way to deal with this is bounded queue lengths and back pressure. I.e don’t deal with an overloaded queue, don’t allow an overloaded queue.
Which is easy to say. I've been trying to debug an overloaded queue for over a week now. (it used to work until I discovered there were some serious race conditions resulting in 1 in a million problems crashes, and every fix for them so far has not fixed things. (at least I can detect it and I'm allowed to toss things from the queue - but the fact is we were handling this before I put the fixes in and people don't like it when I now reject thing from the queue so they want the performance back without the races)
I feel you may be adding your critical sections at too high of a layer (either in the code, or the data structure) if it is severely affecting performance. Look up sharded locks, and totally order them if you must acquire 2 or more at once.
You may also want to implement reader/writer locks if your load has many more reads than writes.
Unfortunately, nobody really teaches you these things in a really clear way, and plenty of engineers don't fully understand it either.
Does it reject entries when service times are too high?
Your debugging effort may become more predictable when the system measures the time workers take to complete.
I note you say it used to work overloaded. I would argue it probably was having hidden problems. Perhaps ask those people what the acceptable service time is and lock it in by refusing new entries when it is exceeded.
If they want both infinite queue length and consistently acceptable service times then you must add enough work resource to do that.
When I give system design interviews, candidates that start adding queues reflexively to the design always do poorly.
Queueing is only useful for a few cases, IMO:
* The request is expensive to reject. For example, the inputs to the rejected request also came from expensive requests or operations (like a file upload). So rejecting the request because of load will multiply the load on other parts of the system. You still need backpressure or forwardpressure (autoscaling).
* Losing a request is expensive, delaying the result is not. Usually you want a suitably configured durable queueing system (e.g. Kafka) if you have this scenario.
* A very short queue is acceptable if it's necessary that downstream resources are kept 100% busy. A good example of this is in a router, the output to a slower link might queue 1-2 packets so that there is always something to send, which maximizes throughput.
* If you have very bursty traffic, you can smooth the bursts to fit in your capacity. But this runs the danger of having the queue always full, which you have to manage with load shedding (either automated or manual).
----
An underappreciated queue type is LIFO (last-in, first-out). It sounds unfair, but it keeps you from moving the median response time at the cost of the maximum response time, and it behaves well when full. It fails over into either responding quickly or just rejecting requests when full, so it works well for dealing with bursty traffic.
That reminds me of this talk[0] by Gil Tene called "How NOT to Measure Latency" at the Strangeloop conference in 2015 (or read this blog post[1] that contains the most important points).
[0] https://www.youtube.com/watch?v=lJ8ydIuPFeU
[1] https://bravenewgeek.com/everything-you-know-about-latency-i...
The author speculates about ways to deal with an overloaded queue.
Kingmans Formula says that as you approach 100% utilization, waiting times explode.
The correct way to deal with this is bounded queue lengths and back pressure. I.e don’t deal with an overloaded queue, don’t allow an overloaded queue.
Which is easy to say. I've been trying to debug an overloaded queue for over a week now. (it used to work until I discovered there were some serious race conditions resulting in 1 in a million problems crashes, and every fix for them so far has not fixed things. (at least I can detect it and I'm allowed to toss things from the queue - but the fact is we were handling this before I put the fixes in and people don't like it when I now reject thing from the queue so they want the performance back without the races)
I feel you may be adding your critical sections at too high of a layer (either in the code, or the data structure) if it is severely affecting performance. Look up sharded locks, and totally order them if you must acquire 2 or more at once.
You may also want to implement reader/writer locks if your load has many more reads than writes.
Unfortunately, nobody really teaches you these things in a really clear way, and plenty of engineers don't fully understand it either.
Is your queue bounded?
Does it reject entries when service times are too high?
Your debugging effort may become more predictable when the system measures the time workers take to complete.
I note you say it used to work overloaded. I would argue it probably was having hidden problems. Perhaps ask those people what the acceptable service time is and lock it in by refusing new entries when it is exceeded.
If they want both infinite queue length and consistently acceptable service times then you must add enough work resource to do that.