Learning DPDK: Eliminating NIC Receive Drops at 100GbE

FastNetMon

June 17, 2026

Header image: dark server rack with blue banners reading 'GUEST POST' and 'LEARNING DPDK', plus the FASTNETMON logo.
Home FastNetMon Blog Learning DPDK: Eliminating NIC Receive Drops at 100GbE

Learning DPDK: Eliminating NIC Receive Drops at 100GbE

TL;DR — everything that matters, in one paragraph. A DPDK poll-mode application can lose packets at 100GbE even when the CPU has spare cycles — because a hardware IRQ landing on a busy-poll worker core stalls it for a few microseconds, and at 100+ Mpps that’s enough to overflow the NIC’s RX descriptor ring before the worker can drain it. The fix is isolation, not more CPU. Pinning the NIC’s completion IRQs off the worker cores (onto a housekeeping core) cut the receive-miss rate from 1 in 2,275 packets to 1 in 16,000,000 — about a 7,000× reduction. To take the residual to essentially zero, isolate the worker cores from the kernel entirely with boot parameters: isolcpusnohz_fullrcu_nocbsirqaffinity=0, and processor.max_cstate=1. None of this costs throughput — it just stops anything from interrupting the cores that poll the NIC.

A DPDK worker is a tight loop that does nothing but poll the NIC and process packets. It has no slack: if the OS steals it for even a few microseconds, the NIC keeps filling the RX ring with no one draining it. At low rates that’s invisible; at 100GbE the ring overflows and you get silent drops.

Default: a NIC completion IRQ on a busy-poll worker core preempts it, the RX ring overflows, and rx_missed rises.
Figure 1: By default a NIC completion IRQ can land on a busy-poll worker core, preempting it for microseconds — long enough to overflow the RX ring during a burst, so rx_missed rises.
Fixed: NIC IRQs pinned to housekeeping core 0, workers poll uninterrupted, rx_missed near zero.
Figure 2: Pin the NIC IRQs to a housekeeping core (core 0) and the workers poll uninterrupted — rx_missed drops to ≈ 0.

The Symptom: rx_missed Under Load

The tell is the NIC’s rx_missed (a.k.a. rx_missed_errors / PHY discards) counter rising under load while CPU utilization shows headroom. The packets never reach the application — the NIC dropped them because the RX descriptor ring was full when they arrived. More workers won’t help; the cores aren’t saturated. Something is interrupting them.

The Cause: Hardware IRQs on Poll Cores

Even with a bifurcated or poll-mode driver, the NIC still raises completion/async interrupts, and by default the kernel is free to deliver them to any core — including the ones running your DPDK workers. On the box behind these numbers, the mlx5 completion IRQs defaulted onto cores 11–13, right on top of busy-poll workers (Figure 1).

Each IRQ preempts the poll loop for only a few microseconds. But do the math: at ~30 Mpps per queue, a single 100 µs stall is **~3,000 packets** — and an 8,192-entry RX ring fills in well under that. One stray interrupt during a burst is a ring overflow.

Pinning IRQs Off the Workers

The first and biggest win is to keep NIC IRQs away from worker cores. Steer every device IRQ to a housekeeping core (core 0) by writing its smp_affinity:

# Send every mlx5 IRQ to core 0 (mask 0x1)for irq in $(grep -l mlx5 /proc/irq/*/* 2>/dev/null | grep -o '[0-9]\+' ); do  echo 1 > /proc/irq/$irq/smp_affinity 2>/dev/nulldone# and stop irqbalance from moving them backsystemctl stop irqbalance

The effect on this hardware:

Staterx-miss rateDrop rate
Before IRQ pin1 / 2,275 pkts0.044 %
After IRQ pin1 / 16,000,000 pkts0.0000062 %

That’s a ~7,000× reduction from one change — and irqbalance must be stopped, or it will quietly reassign the IRQs back onto the workers a minute later.

Full Isolation: isolcpusnohz_fullrcu_nocbs

To take the residual misses to essentially zero, remove the worker cores from the kernel’s reach at boot. Append to the kernel command line (/etc/default/grub, then update-grub and reboot):

isolcpus=1-32 nohz_full=1-32 rcu_nocbs=1-32 \irqaffinity=0 processor.max_cstate=1

Each one closes a different interruption source:

  • isolcpus=1-32 — keep the scheduler from placing any other task on the worker cores.
  • nohz_full=1-32 — stop the periodic 1 kHz timer tick on those cores (no per-millisecond interrupt).
  • rcu_nocbs=1-32 — move RCU callback processing off the worker cores onto housekeeping cores.
  • irqaffinity=0 — default all IRQs to core 0 from boot, before userspace even starts.
  • processor.max_cstate=1 — forbid deep C-states, so a core never takes 50–200 µs to wake from idle.

Use the same core range as your DPDK workers. Together these guarantee that the only thing ever running on a worker core is the poll loop — which is the entire point.

Summary

  1. NIC drops at 100GbE are usually interruption, not CPU saturation — rx_missed rises with cores to spare.
  2. A hardware IRQ stalls a busy-poll worker for µs; at 30 Mpps/queue that overflows the RX ring.
  3. Pin NIC IRQs to a housekeeping core (smp_affinity) and stop irqbalance — here a ~7,000× drop reduction.
  4. Isolate the worker cores at boot: isolcpusnohz_fullrcu_nocbsirqaffinity=0processor.max_cstate=1.
  5. None of it costs throughput — it removes everything that competes with the poll loop.

References