Anatomy of a Throttler, part 1

By Shlomi Noach | August 29, 2024

A throttler is a service or component that pushes back against an incoming flow of requests to ensure the system has the capacity to handle all received requests without being overwhelmed. In this series of posts, we illustrate design considerations for a database system throttler, whose purpose is to keep the database system healthy overall. We discuss choice of metrics, granularity, behavior, impact, prioritization, and other topics.

Which requests do you throttle?

There are different approaches to throttling requests in a database. We focus on throttling asynchronous, batch, and massive operations that are not time critical. Examples could be ETLs, data imports, online DDL operations, mass purges of data, resharding, and so forth. The throttler will push back on those operations that can span minutes, hours, or days of operation. Other forms of throttling may push back on OLTP production traffic. This discussion applies equally to both.

By way of illustration, consider a job that needs to import 10 million rows into the database. Instead of attempting to apply all 10 million in one go, the job breaks down the task into much smaller subtasks: it will try to import (write) 100 rows at a time. Before any such import, it will request access from the throttler.

Some throttler implementations are collaborative, meaning they assume clients will respect their instructions. Others act as barriers between the app and the database. Either way, if the throttler indicates that the database is overloaded, the job should hold back for a period of time and then request access again. This process repeats until granted. Each subtask should be small enough so as not to single-handedly tank the database's serving capacity, while large enough to compensate for the added throttler overhead and to enable meaningful progress.

What does the throttler throttle on?

Some generic throttlers only allow a regulated rate of requests, in anticipation that the consuming job will be able to process them at some known, fixed rate. With databases, things are less clear. A database can only handle so many queries at any given point in time, or over some period of time. However, not all queries are created equal. The database capacity of serving queries depends on the scope of queries, any hot spots or cold spots in affected data, the state of the page cache, overlap or lack thereof of data served by queries, to name a few factors.

We therefore need to be able to determine: how do we consider our database to be "healthy"? How do we determine if it is being overwhelmed?

To do this, we look for metrics that define or predict service level objectives (SLO) for the database. But things are not always so simple. Let's start with a popular metric which is widely used as a throttling indicator, and see what's so special about that metric.

Replication lag

Replication is often used in database clusters, especially ones using a primary-secondary (aka leader-follower) architecture. Sometimes the replication is set to be asynchronous, and at other times group-communication based. In these scenarios replication lag is defined as the time passing between a write on the primary server and the time it is applied or made visible on the replica/secondary server.

In the MySQL world, replication lag is probably the single most used throttling indicator, as multiple third-party and community tools use it to push back against long-running jobs. This is for good reasons: it is easy to measure and it has clear impact on the product and the business. For example, in the case of a database failover, replication lag impacts the time it takes for a replica/standby server to be promoted and made available to receive write requests. Read-after-write can be simplified when replication lag is low, allowing secondary servers to serve some of the read traffic.

We can thus have a business limitation on the acceptable replication lag, which we can use in the throttler: below this lag, allow requests. Above this lag, push back.

Other database metrics: what's in a metric?

Another common metric in the MySQL world is the value of threads_running. On any given server, this is the number of concurrent, actively executing queries (not to be confused with concurrent open transactions, some of which could be idle in between queries). This metric is frequently seen on database dashboards, and is an indicator for the database load.

But, what's an acceptable value? Are 50 concurrent queries OK? Are 100 OK? Pick a number, and you'll soon find it doesn't hold water. Some values are acceptable in early morning, while others are just normal during peak traffic hours. As your product evolves and its adoption increases, so do the queries on your database. What was true 3 months ago is not true today. And again, not all queries are created equal.

What's different about this metric compared with replication lag is that it is much more of a symptom than an actual cause. If all of a sudden we see a sharp spike in active queries, this can indicate some possible causes: perhaps all are held by the commit queue, which for some reason stalls. Or, the queries happen to compete over a specific hotspot and wait on locks. Or, they don't, they all happen to compete on very different pages, none of which is in memory, and they all congest while waiting on the page cache, etc. So what is it exactly that we need to monitor? Is the metric itself useless?

Not necessarily. An experienced administrator may only need to take one look at this metric on the database dashboard to say "we're having an issue".

A closer examination: queues

Software relies heavily on queues. They are fundamental not only in software design but also in hardware access. Requests queue on network access. They queue on disk access. They queue on CPU access. They queue on locks.

Circling back to replication lag, much like concurrent queries, it is a symptom. E.g. disk I/O is saturated on the replica, hence the replica cannot keep up replaying the changelog, thereby accumulating lag. Or perhaps the lag is caused by slow network. Or both! Whatever the case is, what's interesting is that the replication mechanism itself is a queue: the changelog event queue. A new write on the primary manifests as a "write event", which is shipped to the replica, and waits to be consumed (processed, replayed) by the replica. Replication lag is the event's time spent in the queue, where our queue is a combination of the network queue, local disk write queue, actual wait time, and finally the event's processing time. Each of these can be the major contributor to the overall replication lag, and yet, we can still look at replication lag as a whole — as a clear indicator for database health.

Armed with this insight, we take another attempt at understanding other metrics. In the case for concurrent writes, we understand a major contributor to a spike in concurrent queries is their inability to complete. Normally, this means they're held back at commit time, i.e. they wait to be written to the transaction (redo) log. And that means they're in the transaction queue, and we can hence measure the transaction queue latency (aka queue delay).

But, what's a good threshold? Transaction commit delay is typically caused by disk write/flush time, and that changes dramatically across hardware. It is a matter of knowing your metrics. Yet again, an experienced administrator should know what values to expect. But now the values are more tightly bound to hardware and slightly less affected by the app.

Queue delay is not the only metric. Another common one is the queue length: the number of entries waiting in the queue. A long queue at the airport isn't in itself a bad thing, some queues move quite fast, and yet it's often a predictor to wait times. Where wait time is impossible or difficult to measure, queue length can be an alternative.

An operating system's Load Average metric evaluation includes the number of processes waiting for CPU time. This changes by the number of CPUs available. A common rough indicator is a 1 threshold for (load average)/(num CPUs). This is again a metric that must agree with your own systems. Some database deployments famously push their servers to their limits with load averages soaring far above 1 per CPU.

Pool usage

Another indicator is pool usage. The single most common pool with regard to databases must be the application's database connection pool. To run a query, the app will take a connection from the pool, use it to execute the query, then return the connection to the pool. If the pool has connections to spare, getting that connection comes at no cost. But if the pool is exhausted, then either the app needs to wait for the creation of a new connection, or it gets rejected. Similarly to concurrent queries, a high pool usage indicates a congestion of operations. However, pooled connections can be used across multiple queries in a transaction, as well as across multiple transactions, and the app may run its own logic in between running queries, while still holding on to the connection.

An exhausted pool is a strong indication of excessive load, while the difference between a 60% and an 80% used pool is not as clear an indication. Taking a step back, what does it mean that we exhaust some pool? Who decides the size of the pool in the first place? If someone picked a number such as 50 or 100, isn't that number just artificial?

It may well be, but pool size was likely chosen for some good reason(s). It is perhaps derived from some database configuration, which is itself derived from some hardware limitation. And while the choice of metric could possibly change arbitrarily, it is still sensible, as far as throttling goes, to push back when the pool is exhausted. The throttler thereby relies on the greater system configuration and does not introduce any new artificial thresholds.

The case for multiple metrics

A throttler should be able to push back based on a combination of metrics, and not limit itself to just one metric. We've illustrated some metrics above, and every environment may yet have its own load predicting metrics. The administrator should be able to choose an assorted set of metrics the throttler should work with, be able to set specific thresholds for each such metric, and possibly be able to introduce new metrics either programmatically or dynamically.

What does a throttled system look like?

Many software developers will be familiar with the next scenario: you have a multithreaded or otherwise highly concurrent app. There's a bug, likely a race condition or a synchronization issue, and you wish to find it. You choose to print informative debug messages to standard output, and hope to find the bug by examining the log. Alas, when you do so, the bug does not reproduce. Or it may manifest elsewhere.

In adding writes to standard output, you have introduced new locks. Your debug messages now compete over those locks, which in turn incurs different context switches.

Introducing a throttler into your infrastructure shows resemblances to this synchronization example. All of a sudden, there is less contention on the database, and certain apps that used to run just fine, exhibit contention/latency behavior. The appearance of a new job suddenly affects the progress of another. But where previously you could clearly analyze database queries to find the root cause, the database now tells you little to nothing. It's now down to the throttler to give you that information. But even the throttler is limited, because all the apps do is to check the throttler for health status. They do not yet actually do anything.

Let's say we throttle based on replication lag, and let's assume that we want to run an operation so massive that it is bound to drive replication lag high if let loose. With the throttler keeping it under control, though, the operation will only run small batches of subtasks. But an interesting behavior emerges: the operation will push replication lag up to the throttler's threshold, then back down, and push again. As we start the operation, we expect to see the replication lag graph jump up to the threshold value, and then more or less stabilize around that value, slightly higher and slightly lower, for the duration of the operation, which could be hours.

During that time, the operation will be granted access thousands of times or more, and will likewise also be rejected access thousands of times or more. That is how a healthy system looks with a throttler engaged. No matter how many more concurrent operations we run, we expect to contain replication lag at about the same slight offset above or below the threshold. More on this when we discuss granularity.

It is not uncommon for a system to run one or two operations for very long periods, which means what we consider as the throttling threshold (say, a 5sec replication lag) becomes the actual standard. Thankfully, not all operations and workloads are so aggressive that they necessarily push the metrics as high as their thresholds.

Check intervals and metric granularity

A throttler collects the metrics asynchronously from check requests, so that it has an immediate answer available upon request. The intervals at which the throttler collects metrics can have a significant effect on how the throttler is being put to use. Let's consider a case where the throttler collects a metric at a large interval, say every 5 seconds. The metric could be anything at all during those 5 seconds, but it is the specific sampling that takes place at the end of that period that counts.

Similarly, other metrics could have some granularity. Namely, replication lag can be measured in different methods, and the most common one is by deliberate injection of heartbeat events on the primary, and by capturing them on a replica. More on this in another post, but the intervals in which the heartbeat events are generated dictate the granularity or the accuracy of the measured lag. Let's assume we inject heartbeats at one second intervals, and we've just injected a heartbeat at precisely noon. Let's also assume we sample the metrics once per second, and we happen to make that sample at 12:00:00.995. The sample still reads 12:00:00.000 as this was our last injected metric. A client then checks the throttler at 12:00:01.990. By now there will have been a new metric value, but one which we have not sampled yet. The throttler responds by using its last sample that is almost, but not quite, one second old, and which in itself represents a metric that is now almost, but not quite, two seconds old.

Long heartbeat intervals and outdated information have negative impacts on both our system health as well as the throttler's utilization.

On one hand, it is possible that in the duration of the interval, we miss noticing a significant uptick in system load. We'd only find out about it a few seconds later, at which time the throttler would be engaged. However, by that time, the system performance already degrades. It will take a few seconds before it comes down to acceptable values. But then again, once we do catch that the metrics exceed their thresholds, and for the duration of the next interval, we reject all further requests. If the metrics do turn healthy sooner than that, that's a missed opportunity to make some progress. Thus, we degrade the database's operations capacity.

When multiple operations attempt to make progress all at once, all will be throttled while metrics are above threshold, and possibly all released at once when metrics return to low values, thus all pushing the metrics up at once.

Borrowing from the world of networking hardware, it is recommended that metric interval and granularity oversample the range of allowed thresholds. For example, if the acceptable replication lag is at 5 seconds, then it's best to have a heartbeat/sampling interval of 1-2 seconds.

Lower intervals and more accurate metrics reduce spikes and spread the workload more efficiently. That, too, comes at a cost, which we will discuss in a later post.

To be continued

In the next part of this series we will be looking into singular vs. distributed throttler design, as well as the impact the throttler itself may have on your environment.