Optimize Rust async/await performance: a Tokio runbook

Instrument, fix, and verify the high-impact async pitfalls in a production Tokio service.

ByAbhi Panseriya— Fullstack Engineer at Carousell

Pub 26 Jun 202610 min read

Isometric 3D illustration of a glowing asynchronous runtime engine

Your Tokio service handles a thousand requests a second in a benchmark and falls over at two hundred in production. The CPU sits at 30% while p99 latency spikes into seconds. This is almost never a throughput ceiling. It is a handful of async anti-patterns quietly stalling the runtime.

This runbook is for engineers running an existing async service on Tokio 1.x who want to find and fix the high-impact pitfalls: blocking calls on worker threads, locks held across .await, unbounded concurrency, oversized futures, and a mis-tuned runtime. You will instrument the runtime, apply each fix against your own code, and verify the latency drop before shipping.

Prerequisites

A recent stable Rust toolchain (install via rustup)
An existing service on the Tokio 1.x runtime (this runbook targets tokio 1.52)
A reproducible load test you can run before and after (wrk, oha, or k6)
A staging environment, since runtime tuning changes process-level behavior
Permission to redeploy the service and restart it under load

The execution model, in one minute

Every fix below follows from one fact: a Future is inert. It does nothing until an executor calls poll. Each poll returns either Poll::Ready(value) or Poll::Pending. When a future returns Poll::Pending, it stores a Waker so the runtime knows to poll it again later.

A Tokio worker thread runs a loop: pick a ready task, poll it, move on. That loop only makes progress when each poll returns quickly. If a poll runs a synchronous database call, hashes a 4 MB payload, or spins a tight CPU loop, the worker is stuck inside that single poll and cannot drive any other task. With the default of one worker per core, a few stuck polls starve everything. Every step here keeps individual polls short and bounds how many run at once.

Isometric schematic of multi-threaded async runtime with task queues

Step 1. Instrument the runtime with tokio-console

Find the stalled tasks before you change anything, using tokio-console to watch per-task poll times live.

Add the subscriber dependency and enable Tokio's tracing feature:

[dependencies]
tokio = { version = "1.52", features = ["full", "tracing"] }
console-subscriber = "0.5"

Initialize it as the first line of your async entry point:

#[tokio::main]
async fn main() {
    console_subscriber::init();
    // ... the rest of your service setup
}

The subscriber requires the tokio_unstable cfg. Build and run with it set, install the console, then attach in a second terminal:

# terminal 1: run your service with the unstable cfg
RUSTFLAGS="--cfg tokio_unstable" cargo run

# terminal 2: install once, then attach
cargo install --locked tokio-console
tokio-console

Expected result: the console attaches to 127.0.0.1:6669 by default and lists every task with its busy time and poll count. Tasks that hold a worker too long get flagged with a warning. Those are your targets.

Recovery: if nothing connects, the cfg was not applied at build time. Confirm RUSTFLAGS is exported for the build, or set it in .cargo/config.toml, and rebuild.

Step 2. Move blocking and CPU work off the worker threads

Get synchronous work out of the poll loop. The Tokio docs are blunt about why: issuing a blocking call or performing a lot of compute in a future without yielding is problematic, as it may prevent the executor from driving other futures forward.

A CPU-bound or blocking call sitting directly in an async function pins the worker:

// Before: a synchronous, CPU-bound call on a runtime worker thread
async fn handle(req: Request) -> Result<Digest> {
    let digest = expensive_hash(&req.body); // blocks this worker for the whole hash
    Ok(digest)
}

Move it to the blocking pool with spawn_blocking, which runs the closure on a thread dedicated to blocking operations:

use tokio::task;

// After: the hash runs on a dedicated blocking thread, the worker stays free
async fn handle(req: Request) -> Result<Digest> {
    let body = req.body;
    let digest = task::spawn_blocking(move || expensive_hash(&body)).await?;
    Ok(digest)
}

Use spawn_blocking for short-lived blocking work, and a dedicated long-lived thread for a persistent worker. On the multi-threaded runtime you can also use task::block_in_place, which runs the closure on the current thread after handing other tasks off. It panics on a current_thread runtime, so reach for it only on the multi-thread scheduler.

Expected result: the offending task's poll duration in the console drops to microseconds. Under load you will see blocking-thread count climb, capped at the runtime's max_blocking_threads (512 by default).

Recovery: if a spawn_blocking closure needs to be cancellable, note that these tasks cannot be aborted because they are not async. Bound it with a timeout on the awaiting side instead, or revert to the inline call while you redesign.

Step 3. Stop holding locks across .await

A lock guard held while you .await serializes every task that wants that lock, turning concurrent work into a queue. The usual cause is reaching for the async mutex by default.

The tokio::sync::Mutex docs are explicit: Contrary to popular belief, it is ok and often preferred to use the ordinary Mutex from the standard library in asynchronous code. The async mutex exists for one reason, holding a lock across an await, and that makes it more expensive than the blocking mutex.

Scope a standard mutex so the guard drops before any await:

use std::sync::{Arc, Mutex};

// Good: the guard is dropped before we touch the database
let snapshot = {
    let mut guard = data.lock().unwrap();
    guard.counter += 1;
    guard.snapshot()
}; // guard dropped here

write_to_db(&snapshot).await; // no lock held across this await

Keep tokio::sync::Mutex only for its intended job: shared mutable access to an IO resource, such as a single database connection, where you genuinely must hold the lock across the await.

Expected result: contended lock waits disappear from the console, and tasks that previously serialized now overlap. Throughput rises without any change to the work itself.

Recovery: if switching to std::sync::Mutex surfaces a compile error about the guard not being Send across an await, the borrow checker is catching a real held-across-await bug. Fix the scope rather than reverting to the async mutex.

Step 4. Bound unbounded concurrency

Firing every task at once looks fast and then exhausts sockets, file handles, or the connection pool. A collection-wide join_all over ten thousand URLs opens ten thousand sockets:

use futures::future::join_all;

// Before: every request starts at once
let results = join_all(urls.into_iter().map(fetch)).await;

Cap the in-flight count with buffer_unordered from the futures crate. The docs guarantee that no more than n futures will be buffered at any point in time:

use futures::stream::{self, StreamExt};

// After: at most 16 requests run concurrently
let results: Vec<_> = stream::iter(urls)
    .map(fetch)            // each url -> impl Future
    .buffer_unordered(16)  // hard ceiling on in-flight futures
    .collect()
    .await;

Results arrive in completion order. If you need them in the original input order, swap in buffered(n), which applies the same ceiling while preserving order.

Expected result: peak open connections and memory flatten to a predictable ceiling tied to n, and the downstream service stops returning connection-limit errors under burst.

Recovery: if throughput drops too far, raise n in steps and re-measure. The right value is the largest n that keeps the downstream dependency healthy.

Step 5. Tame long CPU loops and oversized futures

Some work is async overall but contains a stretch that never yields. Tokio gives each task an operation budget; once it is spent, the task's resources report not-ready until it yields, which is how the runtime keeps one task from starving the rest. A long synchronous loop never hits an await, so it never cooperates. Insert explicit yield points:

use tokio::task;

for (i, item) in items.iter().enumerate() {
    process(item); // cheap individually, but the loop is long
    if i % 1024 == 0 {
        task::yield_now().await; // hand the worker back to the scheduler
    }
}

If the loop is genuinely CPU-bound rather than occasionally long, prefer spawn_blocking from Step 2. Note that yield_now only suggests a yield: the runtime may immediately re-poll the same task, so treat it as a relief valve, not a scheduling guarantee.

The second problem is future size. Every nested await stores its state inline, so deep or recursive async chains compile into one large state machine that gets moved on each poll and can overflow the stack. Catch the offenders with clippy:

cargo clippy -- -W clippy::large_futures
# trips when a future exceeds future-size-threshold (default 16384 bytes)

Box the heavy or recursive futures so only a pointer moves:

use futures::future::BoxFuture;

fn recurse(n: u32) -> BoxFuture<'static, u32> {
    Box::pin(async move {
        if n == 0 { return 0; }
        recurse(n - 1).await + 1
    })
}

Expected result: long-poll warnings clear in the console, and the clippy lint passes once the boxed futures fall under the threshold.

Recovery: over-yielding adds overhead. If a yield_now inside a hot inner loop dropped throughput, widen the interval or move the whole loop to spawn_blocking.

Step 6. Right-size the runtime

Tune the runtime last, once the code stops abusing it. Build it explicitly so the knobs are visible and version-controlled:

use std::time::Duration;
use tokio::runtime::Builder;

let runtime = Builder::new_multi_thread()
    .worker_threads(8)                          // default: number of CPU cores
    .max_blocking_threads(256)                  // default: 512
    .thread_keep_alive(Duration::from_secs(10)) // default: 10s idle timeout
    .enable_all()
    .build()
    .unwrap();

runtime.block_on(async {
    // run your service
});

Defaults are sensible: one worker per core, up to 512 blocking threads. Change them with evidence, not by reflex. Adding worker threads beyond core count rarely helps IO-bound work and adds context-switching; an IO-bound service that leans on spawn_blocking benefits more from tuning the blocking pool. worker_threads panics if set to zero.

Expected result: with the right counts, the runtime's busy ratio under load sits near your target without workers parking idle or thrashing. Measure it in the next section before keeping any change.

Recovery: runtime config is process-level. If a tuning change regresses latency, revert the builder values and restart. The old behavior returns on the next boot with no residual state.

Verify the gains

Prove the fixes worked with three independent signals, not a gut feeling.

First, re-attach tokio-console under the same load. The tasks you targeted should now show short poll times and no long-poll warnings.

Second, watch runtime health numerically with tokio-metrics. Poll the runtime monitor and log the busy ratio:

use std::time::Duration;

let handle = tokio::runtime::Handle::current();
let monitor = tokio_metrics::RuntimeMonitor::new(&handle);

tokio::spawn(async move {
    for interval in monitor.intervals() {
        println!("busy_ratio={:.3}", interval.busy_ratio);
        tokio::time::sleep(Duration::from_millis(500)).await;
    }
});

Track busy_ratio, total_busy_duration, and total_park_count across the run. A healthy service shows workers busy when work exists and parked when it does not, rather than one worker pinned at 100%.

Third, run the same load test from your prerequisites and compare p99 latency and throughput at identical concurrency. Hold the test fixed; the only variable should be your code. A real fix moves the tail, not just the average.

Rollback

Every change here is code or runtime configuration. None of it touches data, so rollback is clean and carries no data-loss risk.

# revert the optimization commit and redeploy the previous binary
git revert <commit-sha>
cargo build --release
# restart the service / redeploy the previous artifact

Runtime tuning from Step 6 reverts the moment the old binary restarts. Before shipping to production, strip the console-subscriber init and the tokio_unstable cfg unless you want to keep the diagnostics, since the subscriber adds tracing overhead. The one thing rollback does not undo is a downstream dependency that was already overloaded before you added the concurrency ceiling in Step 4; reverting that step can re-open the floodgates, so roll it back last and watch the dependency.

What's next

With the runtime healthy, add span-level tracing to attribute latency to specific code paths, and wire criterion microbenchmarks around the hot functions so regressions surface in CI rather than in production. The habit that matters most is the one from the Verify section: keep the load test fixed and re-run it on every change, so each optimization is judged on the tail latency it actually moves.

Frequently asked

Questions & answers

Should I use tokio::sync::Mutex or std::sync::Mutex in async code?

Use std::sync::Mutex (or parking_lot) when the guard does not cross an .await; it is cheaper, and the Tokio docs call the standard mutex often preferred. Reach for tokio::sync::Mutex only when you must hold the lock across an await, typically around an IO resource like a database connection.

What is the difference between spawn_blocking and block_in_place?

spawn_blocking moves a closure to a dedicated blocking thread and returns a JoinHandle you await. block_in_place runs on the current worker thread but tells the scheduler to off-load other tasks first, and it panics on a current_thread runtime.

How many threads does the Tokio runtime use by default?

The multi-threaded runtime defaults to one worker thread per CPU core, plus up to 512 additional blocking threads for spawn_blocking work, each with a 10-second idle timeout.

Why does a single CPU-heavy task slow my whole async service?

A future that computes without yielding never returns Poll::Pending, so the worker thread cannot poll any other task. Move it to spawn_blocking, or add yield_now points so the scheduler can interleave other work.

How do I see which async tasks are slow?

Run tokio-console: add console-subscriber, call console_subscriber::init(), build with RUSTFLAGS set to --cfg tokio_unstable, then run tokio-console to watch per-task poll times and warnings live.

Research & sources

Primary references reviewed while compiling this guide.

01
std::future::Future - Rust standard library docsdoc.rust-lang.org
02
tokio::task::spawn_blocking - tokio 1.52 docsdocs.rs
03
tokio::sync::Mutex - tokio 1.52 docsdocs.rs
04
tokio::task::block_in_place - tokio 1.52 docsdocs.rs
05
tokio::runtime::Builder - tokio 1.52 docsdocs.rs
06
tokio::task::yield_now - tokio 1.52 docsdocs.rs
07
tokio-rs/console - tokio-console diagnostics toolgithub.com
08
futures::stream::StreamExt - futures 0.3 docsdocs.rs
09
Reducing tail latencies with automatic cooperative task yielding - Tokio blogtokio.rs
10
tokio-metrics - runtime and task instrumentationdocs.rs
11
Clippy lint configuration - future-size-thresholddoc.rust-lang.org

About the author

Abhi Panseriya

Fullstack Engineer at Carousell

Fullstack developer publishing daily blogs on fullstack, frontend, and backend engineering.

Permanent companion pieces - guides, comparisons, glossary entries, and live trackers.

Keep reading

A curated selection of engineering blogs recommended for you next.

prisma3 Jun 2026

Prisma 7 deleted its Rust engine and got faster

Prisma 7 dropped its Rust query engine for a TypeScript and WASM compiler. Queries run up to 3.4x faster, the client shrank 90%, and the edge opened up.

9 min read

postgres9 Jun 2026

Postgres 19 lets you control the query planner. It spent 20 years refusing.

PostgreSQL 19 Beta 1 ships pg_plan_advice, a sanctioned way to stabilize and override query plans after two decades of refusing optimizer hints.

12 min read

drizzle18 May 2026

Drizzle 1.0-rc.1 makes the ORM tax disappear

Drizzle v1.0.0-rc.1 ships opt-in JIT row mappers that the team says make the ORM as fast as raw drivers. Here is what changed and why.

7 min read

bun15 May 2026

Bun's Rust rewrite merged in six days. 13,000 unsafe blocks came with it.

Bun merged a million-line, Claude-authored Rust rewrite on May 14. 99.8% of tests pass. About 13,000 unsafe blocks come with it.

6 min read

Optimize Rust async/await performance: a Tokio runbook

Prerequisites

The execution model, in one minute

Step 1. Instrument the runtime with tokio-console

Step 2. Move blocking and CPU work off the worker threads

Step 3. Stop holding locks across .await

Step 4. Bound unbounded concurrency

Step 5. Tame long CPU loops and oversized futures

Step 6. Right-size the runtime

Verify the gains

Rollback

What's next

Questions & answers

Research & sources

About the author

Optimize Interaction to Next Paint (INP) below 200 ms

Bun vs Node.js in 2026: The Real Decision Framework

Qwik vs Astro: Which to Pick in 2026

Edge Runtime

Keep reading

Prisma 7 deleted its Rust engine and got faster

Postgres 19 lets you control the query planner. It spent 20 years refusing.

Drizzle 1.0-rc.1 makes the ORM tax disappear

Bun's Rust rewrite merged in six days. 13,000 unsafe blocks came with it.

Prerequisites

The execution model, in one minute

Step 1. Instrument the runtime with tokio-console

Step 2. Move blocking and CPU work off the worker threads

Step 3. Stop holding locks across .await

Step 4. Bound unbounded concurrency

Step 5. Tame long CPU loops and oversized futures

Step 6. Right-size the runtime

Verify the gains

Rollback

What's next

Questions & answers

Research & sources

About the author

Related references

Optimize Interaction to Next Paint (INP) below 200 ms

Bun vs Node.js in 2026: The Real Decision Framework

Qwik vs Astro: Which to Pick in 2026

Edge Runtime

Keep reading

Prisma 7 deleted its Rust engine and got faster

Postgres 19 lets you control the query planner. It spent 20 years refusing.

Drizzle 1.0-rc.1 makes the ORM tax disappear

Bun's Rust rewrite merged in six days. 13,000 unsafe blocks came with it.