Learning VPP: Filtering Packets at 100GbE Line Rate

FastNetMon



July 2, 2026

Blue-toned motherboard close-up with a glowing 8.8 display; banner reads 'Guest Post – Learning VPP'.

Portrait of a man with short dark hair and light stubble, wearing a dark gray T-shirt, facing the camera (circular crop).

This post is a repost of technical blog originally published by Denys Haryachyy, shared here with permission as part of ongoing research and engineering work around FastNetMon’s inline traffic processing capabilities.

TL;DR. A VPP software data plane classifies and drops packets using a tuple-space search (TSS): rules are grouped by mask shape into per-mask bihash tables, and each packet probes one table per distinct mask. Matching cost is O(number of distinct masks), independent of how many rules you load. VPP’s in-tree ACL plugin solves the same problem with the TupleMerge variant (one shared bihash, relaxed and merged masks); our filter uses plain TSS (one small exact-mask hash per shape). Same idea, opposite trade-offs.

Dropping a packet sounds free, but the CPU still has to receive it, classify it, decide, and free the buffer. This article is about the classifier that makes the rule-matching step effectively free — no matter how many rules you load.

Figure 1: Tuple-space search — each packet probes one bihash per distinct mask shape (O(#masks)), so matching cost is independent of the rule count.

Tuple-Space Search: One Hash per Mask Shape

Tuple-space search (Figure 1) avoids walking the rule list at all. A rule’s mask shape is the combination of fields it constrains — source/destination prefix lengths, protocol, port ranges. Rules that share a mask shape can be matched by a single exact-match hash lookup on the masked packet fields. So TSS:

Groups rules by mask shape into a handful of per-mask hash tables — each one a VPP clib_bihash_24_8 (a bounded-index, lock-free hash), keyed on the packet’s 7-tuple fields masked to that shape.
Probes one table per distinct mask for each packet. The cost is the number of distinct masks (a small constant — real-world ACLs have tens of mask shapes even with millions of rules), not the number of rules. The first table that returns a hit gives the action; if no table matches, the packet passes.

IPv6 rules use the same TSS with a wider key — a per-mask clib_bihash_40_8 (40-byte key: dst + src IPv6 + protocol). Both the IPv4 and IPv6 paths are pure TSS; only the key width differs (24 B vs 40 B).

The consequence is the headline of this whole article: classification cost scales with the number of distinct mask shapes, not with how many rules you load. Load 100 rules or 1M; if they collapse into the same handful of masks, the per-packet cost is identical.

Figure 2: A worked tuple-space-search example. Three rules collapse into three mask shapes, each with its own bihash; a packet is masked once per shape and probed in parallel. Mask B hits rule R2 → DROP. Cost is one probe per mask, independent of how many rules each table holds. — Figure 2: A worked tuple-space-search example. Three rules collapse into three mask shapes, each with its own `bihash`. The packet (dst 192.168.1.55, src 8.8.8.8) is masked once per shape — dst/8, dst/24, dst/24+src/16 — and each masked key probes its table. Mask B hits rule R2 → **DROP**. The cost is exactly three probes (one per mask), no matter how many rules each table holds.

VPP’s Own ACL Plugin: TSS, but TupleMerge

VPP already ships a tuple-space search — in the in-tree ACL plugin — and it’s worth comparing, because it makes the opposite set of trade-offs. Both group rules by mask shape and probe hash tables instead of scanning a rule list. But the ACL plugin uses the TupleMerge variant (Daly & Torng, ICCCN 2017) and folds every table into a single shared hash, where our filter stays plain TSS with one small hash per mask.

Figure 3: VPP's ACL plugin packs every per-mask table into one shared bihash_48_8 keyed by the 5-tuple plus a mask-type index, where each hit is a candidate to re-verify; our filter keeps one small bihash_24_8 per mask shape with the rule index inline in the value. — Figure 3: Two takes on TSS. VPP’s ACL plugin (left) merges masks into one shared `bihash_48_8`, keyed by the full 5-tuple plus a mask-type index, so every hit is a candidate to re-verify. Our filter (right) keeps one small `bihash_24_8` per mask shape with the rule index stored inline.

VPP’s ACL plugin. TupleMerge relaxes and merges compatible masks into fewer tables, so a rule can land in a table whose mask omits some of the bits it actually constrains. That bounds the table count — it splits a table once it collects more than 39 colliding rules — but it means a hash hit is only a candidate: the matched rule is re-verified (port ranges and the relaxed-away bits) before it counts. Every logical table lives in one shared clib_bihash_48_8; the 48-byte key is the full 5-tuple plus a mask_type_index and a lookup-context index, so a single physical table holds every per-mask table and every interface’s context at once. To honor ACE order it probes every mask type and keeps the lowest applied-entry index — it can’t stop at the first hit.

Figure 4: The same three rules and packet under TupleMerge. Relaxing masks merges the rules into fewer mask-types in one shared bihash, so the packet takes fewer probes (2 vs 3 in Figure 2) — but a hash hit is only a candidate: rule R2 is re-verified against the relaxed-away bits before it counts. — Figure 4: The same three rules and packet as Figure 2, now under **TupleMerge**. Relaxing masks merges them into fewer mask-types in one shared `bihash`, so the packet takes **fewer probes (2 vs 3)** — but because bits were relaxed away, a hit is only a *candidate*: the matched rule (R2) is **re-verified** against the omitted bits before it counts. In plain TSS (Figure 2) the exact-mask hit needs no such re-check.

Our filter. Ours is plain TSS: one clib_bihash_24_8 per distinct mask shape, no merging and no relaxation. The 24-byte key is just the masked destination and source IPv4 plus protocol, and the value is the rule index inlined directly — a collision chain appears only when two rules share a masked key. Because masks are exact, the address-and-protocol match from the hash is authoritative; we only re-check the fields that aren’t in the key (ports, DSCP, fragment flags, length). Like VPP we probe every tuple and keep the lowest rule order, but a sorted collision chain lets us prune as soon as we pass the best match. IPv6 rules use a separate set of per-mask clib_bihash_40_8 tables — a 40-byte key holding the full source and destination IPv6 addresses plus protocol — so IPv6 is the same TSS, just a wider key.

	VPP ACL plugin	our filter
Algorithm	TupleMerge (TSS + mask relax/merge)	Plain TSS (one table per mask)
Hash table	one shared `clib_bihash_48_8`	one `bihash_24_8` (IPv4) / `bihash_40_8` (IPv6) per mask
Key	48 B: 5-tuple + `mask_type_index` + `lc_index`	24 B (IPv4) / 40 B (IPv6): masked dst + src + proto
Value	applied-entry index → ACE	rule index inline, or collision-chain index
Hash hit	candidate — re-verify relaxed bits + ports	addr+proto exact; re-verify ports/DSCP/frag only
Table count	bounded (merge + split at 39 collisions)	one per distinct mask shape
Priority	scan all tables, keep lowest ACE index	probe all tuples, keep lowest rule order
IPv6	same 48 B key (`is_ip6` bit)	separate per-mask `bihash_40_8` (TSS)

Why Plain TSS for a DDoS Filter

TupleMerge is the right tool for a general-purpose ACL — thousands of rules across hundreds of distinct mask shapes, where the number of tables would otherwise explode. A DDoS drop filter is the opposite workload, and that flips the trade-offs:

Few, fixed mask shapes — nothing to merge. Mitigation rules are coarse and uniform (drop by victim prefix, by protocol, by amplification port), so they collapse into a handful of shapes. TupleMerge’s table-bounding machinery buys nothing when there are only a couple of shapes to begin with.
You match constantly, so re-verification is the wrong cost. Under a flood the filter matches (drops) most packets, so re-verifying every hit (Figure 4) is paid on nearly every dropped packet. Plain TSS uses exact masks, so an address-and-protocol hit is authoritative — only the fields not in the key (ports, DSCP, fragment) need a cheap recheck.
No table mutation under load. TupleMerge splits and re-homes tables as rules collide; a mitigation system adds and removes rules live as attacks are detected, so a simple insert into the right exact-mask hash — no relaxation, no splitting — means predictable behaviour and fast rule churn exactly when you are under attack.
The mask shapes are yours, not the attacker’s. TSS classifiers can be attacked by inducing many distinct tuples; in a purpose-built filter the shapes are fixed by your rule set, not by traffic, so the probe count is bounded by design and there is no explosion to tame.

Load an arbitrary ACL with hundreds of distinct masks and the trade inverts — TupleMerge’s bounded table count would win. But that is not the DDoS case: here the shapes are few and fixed, so plain TSS keeps authoritative exact-mask hits, a smaller key, and a simple lock-free insert path — the same algorithm with the general-purpose tax removed.

Cost vs. Number of Mask Shapes (Measured)

Since the cost driver is the number of distinct mask shapes, not the rule count, we measured it directly: load ~1M rules across N distinct shapes (one hot shape the flood matches → dropped, plus N−1 cold ones) and sample the classifier’s cyc/pkt and DUT throughput under an RSS-optimal line-rate flood, with each per-mask bihash sized to tss-bihash-buckets = 1048576 (2^20 buckets, ≈ 1 entry/bucket for ~1M rules; AMD EPYC 7742, 32 workers):

A mask shape is the set of fields a rule constrains, with their prefix lengths — one shape = one tuple = one probe. The sweep mints N distinct shapes; a few of the ones it uses:

udp, dst-port 8000–8255 — the hot shape the flood matches → dropped
dst /24 — a subnet (the bulk: ~1M cold rules all share this one shape)
dst /32 — a single host
dst /24 + udp
dst /24 + udp + dst-port
dst /24 + src /24
dst /24 + src /32 + udp

The loader also varies the prefix length (/20, /22, /24, /25, /26, /28, /30, /32) to manufacture more distinct shapes up to N. Two rules share a shape only if they constrain the same fields at the same prefix lengths — so a million dst /24 rules are one shape, while dst /24 and dst /32 are two.

Distinct mask shapes (probes/pkt)	filter cyc/pkt	DUT Mpps
1	77	142.1
2	94	142.1
4	122	142.1
8	171	140.9
16	290	138.2
32	538	102.9
64	1096	56.8

cyc/pkt grows ~linearly with shape count — cost is O(distinct mask shapes). Throughput holds 100GbE line rate (142 Mpps) through ~8 shapes and stays near it to 16, then falls as the per-packet cost exceeds the worker’s cycle budget. A DDoS rule set has only a handful of shapes, so it sits in the flat 142-Mpps regime no matter how many rules you load; a general ACL with tens of shapes is where the climb begins — TupleMerge’s territory.

We Tried It: Reusing VPP’s ACL Engine

We also tested the obvious shortcut: reusing VPP’s stock ACL plugin (a TupleMerge classifier) as the lookup engine, wired into the same node so only the lookup differs. On identical rules and flood it cost ~29× more cycles per packet and couldn’t hold line rate — TupleMerge’s strengths (one bounded shared table, full stateful-ACL generality) are exactly what a line-rate drop filter doesn’t need.

Engine (5 rules, 1 mask shape)	Cycles / packet	Throughput
Plain TSS (our filter)	82	142 Mpps (line rate)
TupleMerge (VPP ACL plugin)	2,352	~29 Mpps

Measured on 32 worker cores, AMD EPYC 7742 (~2.25 GHz base), 100GbE — same rules and flood, only the lookup engine differs. That CPU budget is why the numbers land where they do: at 2,352 cyc/pkt, 32 cores × 2.25 GHz tops out at ~29 Mpps; TSS’s 82 cyc leaves the CPU mostly idle, so the 100GbE NIC — not the filter — sets its 142 Mpps ceiling.

Summary

Tuple-space search over per-mask bihash tables keeps rule-matching off the hot path: cost is O(#distinct masks), independent of how many rules you load.
A rule’s mask shape picks its table; a packet probes one table per distinct mask and takes the lowest-order match.
VPP’s ACL plugin uses TupleMerge (relaxed masks merged into one shared bihash, hits re-verified) for arbitrary ACLs; our filter uses plain TSS (one exact-mask bihash per shape) — cheaper for the handful of shapes a DDoS filter has.

References

FD.io VPP documentation — the Vector Packet Processor and its node-graph data plane.
VPP bihash — bounded-index extensible hash — the lock-free hash behind tuple-space search.
Tuple Space Search (Srinivasan et al., 1999) — the packet-classification algorithm grouping rules by mask shape.
TupleMerge (Daly et al., IEEE/ACM ToN 2019) — the TSS variant VPP’s ACL plugin uses: it merges compatible masks into fewer tables, trading exact-mask hits for bounded table growth.
The Design and Implementation of Open vSwitch (Pfaff et al., NSDI 2015) — tuple-space search in production: OVS’s megaflow classifier is a TSS (§5).
Intel — OVS-DPDK Datapath Classifier — an accessible walkthrough of the TSS classifier (subtables = tuples, masking, subtable ranking).

Learning VPP: Filtering Packets at 100GbE Line Rate

Tuple-Space Search: One Hash per Mask Shape

VPP’s Own ACL Plugin: TSS, but TupleMerge

Why Plain TSS for a DDoS Filter

Cost vs. Number of Mask Shapes (Measured)

We Tried It: Reusing VPP’s ACL Engine

Summary

References

Latest Posts

Learning Mellanox ConnectX-5: CQE Compression Tuning

FastNetMon Now Supports HTTP and HTTPS Proxies

NetUK3 – Event Recap

Introducing Netomics: a self-hosted routing intelligence platform for network operations

Get started

Automate your DDoS Defence with FastNetMon

Start a Trial

Talk to our Team

Read our Docs