Last Updated: 2018-09-12 19:47:11 UTC
by Johannes Ullrich (Version: 1)
)[Disclaimer: This article deals with legacy IPv4 networks. IPV6 has cleaned up some of the fragmentation issues, and it looks like IPv4 is backporting some of these changes]
IP fragmentation has always been a tricky issue. Many operating systems had issues implementing it and RFCs have often been ignored (for more or less good reasons). Over the last years, techniques like "Path MTU Discovery" have become popular to mostly eliminate the need for fragmentation, in particular with IPv6 making it mandatory.
So first a quick primer on the how and why of fragmentation in IPv4.
We need fragmentation mostly because not all networks use the same MTU. IPv4 requires that hosts are able to process at least 68 bytes, which is just enough for the maximum IP header size and a couple bytes of protocol header [RFC 791]. Ethernet typically uses an MTU of 1500 bytes but can go all the way up to 9198 bytes. So in short, MTUs are "all over" and there is no guarantee as to what MTU you will find on networks forwarding your packet.
Path MTU Discovery solved this problem. TCP packets are now sent with the "Do Not Fragment" flag set, and routers will report back if the packet is too large (I hope you are allowing ICMP errors back in your network to support this).
Problems with fragments:
- They may arrive out of order. So recipients need to buffer them (for how long? The IPv4 RFC doesn't say..)
- They could overlap (the RFC suggests that hosts take the first arriving copy in this case, but not all operating systems have done this in the past)
- buffering fragments requires resources
One issue that highlighted these problems recently was labeled "Fragmentsmack". Reassembling lots of fragments arriving in various orders can overwhelm some of the reassembly algorithms, and as a result, cause a DoS condition. This issue appears to have affected Linux and Windows. Linux advised using a smaller memory buffer for fragments to fix this issue. Microsoft yesterday' suggested in its patch Tuesday note to drop all out of order fragments via a registry fix.
For Linux, a patch was submitted in response that would drop all overlapping fragments.
So this got me to think about how much fragments there are in modern networks. I hadn't checked in a while, but my overall guess was "not much", and that has been shown true so far. I collected all fragments for a day from a couple of networks, and also checked with others who looked into this issue.
The number one source of fragments appears to be DNS. In particular, if you are using DNSSEC, your DNS server will support a feature called "EDNS0" (Extended DNS Option 0, RFC 2671). Historically, DNS limited UDP responses to 512 bytes (RFC 1035). DNSSEC often requires larger responses for keys and such, so EDNS0 allows a client to signal to a server that it is willing to receive larger replies. This is more efficient than using truncated UDP replies and than following up with a TCP request. RFC 2671 suggests 1280 bytes as "reasonable" for an ethernet connected client, but doesn't mandate a particular size. I often see 4096 bytes used. These larger responses lead to fragmentation.
I have also heard of issues with SIP and fragmentation, but haven't been able to observe this first hand.
So what does this all mean for your networks? I wrote a quick scapy script to see how different current (fully patched) operating systems dealt with fragments. I looked at three cases:
- normal fragments (in order and no overlap)
- out of order fragments (I sent the first fragment last)
- overlapping fragments (just two fragments)
You can find the script here: https://github.com/jullrich/ipv6/blob/master/scapy/fragtest.py
I ran it against PFSense (FreeBSD 11.1-RELEASE-p10), Linux 3.10 (CentOS 7), Linux 4.15 (Ubuntu 18.04), OS X 10.11.6 and MacOS 10.13.6 and all of them responded to all three cases.
Window 10 is so far the only operating system that did not respond to overlapping fragments at all. There is always a chance that I got the script wrong. But I even tried it with an identical payload. The overlapping payload in the script was selected so it would generate identical checksums, no matter if the first or second copy is selected.
If you have any insight or are able to run the script against other systems, please let me know what you find. In all cases, I disabled the host-based firewall.