Your Tokio service handles a thousand requests a second in a benchmark and falls over at two hundred in production. The CPU sits at 30% while p99 latency spikes into seconds. This is almost never a throughput ceiling. It is a handful of async anti-patterns quietly stalling the runtime.
This runbook is for engineers running an existing async service on Tokio 1.x who want to find and fix the high-impact pitfalls: blocking calls on worker threads, locks held across .await, unbounded concurrency, oversized futures, and a mis-tuned runtime. You will instrument the runtime, apply each fix against your own code, and verify the latency drop before shipping.
Prerequisites
- A recent stable Rust toolchain (install via rustup)
- An existing service on the Tokio 1.x runtime (this runbook targets
tokio 1.52) - A reproducible load test you can run before and after (wrk,
oha, or k6) - A staging environment, since runtime tuning changes process-level behavior
- Permission to redeploy the service and restart it under load
The execution model, in one minute
Every fix below follows from one fact: a Future is inert. It does nothing until an executor calls poll. Each poll returns either Poll::Ready(value) or Poll::Pending. When a future returns Poll::Pending, it stores a Waker so the runtime knows to poll it again later.
A Tokio worker thread runs a loop: pick a ready task, poll it, move on. That loop only makes progress when each poll returns quickly. If a poll runs a synchronous database call, hashes a 4 MB payload, or spins a tight CPU loop, the worker is stuck inside that single poll and cannot drive any other task. With the default of one worker per core, a few stuck polls starve everything. Every step here keeps individual polls short and bounds how many run at once.

Step 1. Instrument the runtime with tokio-console
Find the stalled tasks before you change anything, using tokio-console to watch per-task poll times live.
Add the subscriber dependency and enable Tokio's tracing feature:
[dependencies]
tokio = { version = "1.52", features = ["full", "tracing"] }
console-subscriber = "0.5"Initialize it as the first line of your async entry point:
#[tokio::main]
async fn main() {
console_subscriber::init();
// ... the rest of your service setup
}The subscriber requires the tokio_unstable cfg. Build and run with it set, install the console, then attach in a second terminal:
# terminal 1: run your service with the unstable cfg
RUSTFLAGS="--cfg tokio_unstable" cargo run
# terminal 2: install once, then attach
cargo install --locked tokio-console
tokio-consoleExpected result: the console attaches to 127.0.0.1:6669 by default and lists every task with its busy time and poll count. Tasks that hold a worker too long get flagged with a warning. Those are your targets.
Recovery: if nothing connects, the cfg was not applied at build time. Confirm RUSTFLAGS is exported for the build, or set it in .cargo/config.toml, and rebuild.
Step 2. Move blocking and CPU work off the worker threads
Get synchronous work out of the poll loop. The Tokio docs are blunt about why: issuing a blocking call or performing a lot of compute in a future without yielding is problematic, as it may prevent the executor from driving other futures forward.
A CPU-bound or blocking call sitting directly in an async function pins the worker:
// Before: a synchronous, CPU-bound call on a runtime worker thread
async fn handle(req: Request) -> Result<Digest> {
let digest = expensive_hash(&req.body); // blocks this worker for the whole hash
Ok(digest)
}Move it to the blocking pool with spawn_blocking, which runs the closure on a thread dedicated to blocking operations:
use tokio::task;
// After: the hash runs on a dedicated blocking thread, the worker stays free
async fn handle(req: Request) -> Result<Digest> {
let body = req.body;
let digest = task::spawn_blocking(move || expensive_hash(&body)).await?;
Ok(digest)
}Use spawn_blocking for short-lived blocking work, and a dedicated long-lived thread for a persistent worker. On the multi-threaded runtime you can also use task::block_in_place, which runs the closure on the current thread after handing other tasks off. It panics on a current_thread runtime, so reach for it only on the multi-thread scheduler.
Expected result: the offending task's poll duration in the console drops to microseconds. Under load you will see blocking-thread count climb, capped at the runtime's max_blocking_threads (512 by default).
Recovery: if a spawn_blocking closure needs to be cancellable, note that these tasks cannot be aborted because they are not async. Bound it with a timeout on the awaiting side instead, or revert to the inline call while you redesign.
Step 3. Stop holding locks across .await
A lock guard held while you .await serializes every task that wants that lock, turning concurrent work into a queue. The usual cause is reaching for the async mutex by default.
The tokio::sync::Mutex docs are explicit: Contrary to popular belief, it is ok and often preferred to use the ordinary Mutex from the standard library in asynchronous code. The async mutex exists for one reason, holding a lock across an await, and that makes it more expensive than the blocking mutex.
Scope a standard mutex so the guard drops before any await:
use std::sync::{Arc, Mutex};
// Good: the guard is dropped before we touch the database
let snapshot = {
let mut guard = data.lock().unwrap();
guard.counter += 1;
guard.snapshot()
}; // guard dropped here
write_to_db(&snapshot).await; // no lock held across this awaitKeep tokio::sync::Mutex only for its intended job: shared mutable access to an IO resource, such as a single database connection, where you genuinely must hold the lock across the await.
Expected result: contended lock waits disappear from the console, and tasks that previously serialized now overlap. Throughput rises without any change to the work itself.
Recovery: if switching to std::sync::Mutex surfaces a compile error about the guard not being Send across an await, the borrow checker is catching a real held-across-await bug. Fix the scope rather than reverting to the async mutex.
Step 4. Bound unbounded concurrency
Firing every task at once looks fast and then exhausts sockets, file handles, or the connection pool. A collection-wide join_all over ten thousand URLs opens ten thousand sockets:
use futures::future::join_all;
// Before: every request starts at once
let results = join_all(urls.into_iter().map(fetch)).await;Cap the in-flight count with buffer_unordered from the futures crate. The docs guarantee that no more than n futures will be buffered at any point in time:
use futures::stream::{self, StreamExt};
// After: at most 16 requests run concurrently
let results: Vec<_> = stream::iter(urls)
.map(fetch) // each url -> impl Future
.buffer_unordered(16) // hard ceiling on in-flight futures
.collect()
.await;Results arrive in completion order. If you need them in the original input order, swap in buffered(n), which applies the same ceiling while preserving order.
Expected result: peak open connections and memory flatten to a predictable ceiling tied to n, and the downstream service stops returning connection-limit errors under burst.
Recovery: if throughput drops too far, raise n in steps and re-measure. The right value is the largest n that keeps the downstream dependency healthy.
Step 5. Tame long CPU loops and oversized futures
Some work is async overall but contains a stretch that never yields. Tokio gives each task an operation budget; once it is spent, the task's resources report not-ready until it yields, which is how the runtime keeps one task from starving the rest. A long synchronous loop never hits an await, so it never cooperates. Insert explicit yield points:
use tokio::task;
for (i, item) in items.iter().enumerate() {
process(item); // cheap individually, but the loop is long
if i % 1024 == 0 {
task::yield_now().await; // hand the worker back to the scheduler
}
}If the loop is genuinely CPU-bound rather than occasionally long, prefer spawn_blocking from Step 2. Note that yield_now only suggests a yield: the runtime may immediately re-poll the same task, so treat it as a relief valve, not a scheduling guarantee.
The second problem is future size. Every nested await stores its state inline, so deep or recursive async chains compile into one large state machine that gets moved on each poll and can overflow the stack. Catch the offenders with clippy:
cargo clippy -- -W clippy::large_futures
# trips when a future exceeds future-size-threshold (default 16384 bytes)Box the heavy or recursive futures so only a pointer moves:
use futures::future::BoxFuture;
fn recurse(n: u32) -> BoxFuture<'static, u32> {
Box::pin(async move {
if n == 0 { return 0; }
recurse(n - 1).await + 1
})
}Expected result: long-poll warnings clear in the console, and the clippy lint passes once the boxed futures fall under the threshold.
Recovery: over-yielding adds overhead. If a yield_now inside a hot inner loop dropped throughput, widen the interval or move the whole loop to spawn_blocking.
Step 6. Right-size the runtime
Tune the runtime last, once the code stops abusing it. Build it explicitly so the knobs are visible and version-controlled:
use std::time::Duration;
use tokio::runtime::Builder;
let runtime = Builder::new_multi_thread()
.worker_threads(8) // default: number of CPU cores
.max_blocking_threads(256) // default: 512
.thread_keep_alive(Duration::from_secs(10)) // default: 10s idle timeout
.enable_all()
.build()
.unwrap();
runtime.block_on(async {
// run your service
});Defaults are sensible: one worker per core, up to 512 blocking threads. Change them with evidence, not by reflex. Adding worker threads beyond core count rarely helps IO-bound work and adds context-switching; an IO-bound service that leans on spawn_blocking benefits more from tuning the blocking pool. worker_threads panics if set to zero.
Expected result: with the right counts, the runtime's busy ratio under load sits near your target without workers parking idle or thrashing. Measure it in the next section before keeping any change.
Recovery: runtime config is process-level. If a tuning change regresses latency, revert the builder values and restart. The old behavior returns on the next boot with no residual state.
Verify the gains
Prove the fixes worked with three independent signals, not a gut feeling.
First, re-attach tokio-console under the same load. The tasks you targeted should now show short poll times and no long-poll warnings.
Second, watch runtime health numerically with tokio-metrics. Poll the runtime monitor and log the busy ratio:
use std::time::Duration;
let handle = tokio::runtime::Handle::current();
let monitor = tokio_metrics::RuntimeMonitor::new(&handle);
tokio::spawn(async move {
for interval in monitor.intervals() {
println!("busy_ratio={:.3}", interval.busy_ratio);
tokio::time::sleep(Duration::from_millis(500)).await;
}
});Track busy_ratio, total_busy_duration, and total_park_count across the run. A healthy service shows workers busy when work exists and parked when it does not, rather than one worker pinned at 100%.
Third, run the same load test from your prerequisites and compare p99 latency and throughput at identical concurrency. Hold the test fixed; the only variable should be your code. A real fix moves the tail, not just the average.
Rollback
Every change here is code or runtime configuration. None of it touches data, so rollback is clean and carries no data-loss risk.
# revert the optimization commit and redeploy the previous binary
git revert <commit-sha>
cargo build --release
# restart the service / redeploy the previous artifactRuntime tuning from Step 6 reverts the moment the old binary restarts. Before shipping to production, strip the console-subscriber init and the tokio_unstable cfg unless you want to keep the diagnostics, since the subscriber adds tracing overhead. The one thing rollback does not undo is a downstream dependency that was already overloaded before you added the concurrency ceiling in Step 4; reverting that step can re-open the floodgates, so roll it back last and watch the dependency.
What's next
With the runtime healthy, add span-level tracing to attribute latency to specific code paths, and wire criterion microbenchmarks around the hot functions so regressions surface in CI rather than in production. The habit that matters most is the one from the Verify section: keep the load test fixed and re-run it on every change, so each optimization is judged on the tail latency it actually moves.
