[Discuss] Off-Topic [IP] BufferBloat: What's Wrong with the Internet? (fwd)

Mon Dec 12 16:13:37 EST 2011

Perhaps of interest -s.r.

---------- Forwarded message ----------
Date: Fri, 9 Dec 2011 09:52:45 -0500
From: Dave Farber <dave at farber.net>
To: ip <ip at listbox.com>
Subject: [IP] BufferBloat: What's Wrong with the Internet?

<http://queue.acm.org/detail.cfm?id=2076798>

BufferBloat: What's Wrong with the Internet?
A discussion with Vint Cerf, Van Jacobson, Nick Weaver, and Jim Gettys

Internet delays are now as common as they are maddening. That 
means they end up affecting system engineers just like all the 
rest of us. And when system engineers get irritated, they often 
go looking for what's at the root of the problem. Take Jim 
Gettys, for example. His slow home network had repeatedly proved 
to be the source of considerable frustration, so he set out to 
determine what was wrong, and he even coined a term for what he 
found: bufferbloat.

Bufferbloat refers to excess buffering inside a network, 
resulting in high latency and reduced throughput. Some buffering 
is needed; it provides space to queue packets waiting for 
transmission, thus minimizing data loss. In the past, the high 
cost of memory kept buffers fairly small, so they filled quickly 
and packets began to drop shortly after the link became 
saturated, signaling to the communications protocol the presence 
of congestion and thus the need for compensating adjustments.

Because memory now is significantly cheaper than it used to be, 
buffering has been overdone in all manner of network devices, 
without consideration for the consequences. Manufacturers have 
reflexively acted to prevent any and all packet loss and, by 
doing so, have inadvertently defeated a critical TCP 
congestion-detection mechanism, with the result being worsened 
congestion and increased latency.

Now that the problem has been diagnosed, people are working 
feverishly to fix it. This case study considers the extent of the 
bufferbloat problem and its potential implications. Working to 
steer the discussion is Vint Cerf, popularly known as one of the 
"fathers of the Internet." As the co-designer of the TCP/IP 
protocols, Cerf did indeed play a key role in developing the 
Internet and related packet data and security technologies while 
at Stanford University from 1972-1976 and with DARPA (the U.S. 
Department of Defense's Advanced Research Projects Agency) from 
1976-1982. He currently serves as Google's chief Internet 
evangelist.

Van Jacobson, presently a research fellow at PARC where he leads 
the networking research program, is also central to this 
discussion. Considered one of the world's leading authorities on 
TCP, he helped develop the RED (random early detection) queue 
management algorithm that has been widely credited with allowing 
the Internet to grow and meet ever-increasing throughput demands 
over the years. Prior to joining PARC, Jacobson was a chief 
scientist at Cisco Systems and later at Packet Design Networks.

Also participating is Nick Weaver, a researcher at ICSI 
(International Computer Science Institute in Berkeley where he 
was part of the team that developed Netalyzr, a tool that 
analyzes network connections and has been instrumental in 
detecting bufferbloat and measuring its impact across the 
Internet.

Rounding out the discussion is Gettys, who edited the HTTP/1.1 
specification and was a co-designer of the X Window System. He 
now is a member of the technical staff at Alcatel-Lucent Bell 
Labs, where he focuses on systems design and engineering, 
protocol design, and free software development.

VINT CERF What caused you to do the analysis that led you to 
conclude you had problems with your home network related to 
buffers in intermediate devices?

JIM GETTYs I was running some bandwidth tests on an old IPsec 
(Internet Protocol Security)-like device that belongs to Bell 
Labs and observed latencies of as much as 1.2 seconds whenever 
the device was running as fast it could. That didn't entirely 
surprise me, but then I happened to run the same test without the 
IPsec box in the way, and I ended up with the same result. With 
1.2-second latency accompanied by horrible jitter, my home 
network obviously needed some help. The rule of thumb for good 
telephony is 150-millisecond latency at most, and my network had 
nearly 10 times that much.

My first thought was that the problem might relate to a feature 
called PowerBoost that comes as part of my home service from 
Comcast. That led me to drop a note to Rich Woundy at Comcast 
since his name appears on the Internet draft for that feature. He 
lives in the next town over from me, so we arranged to get 
together for lunch. During that lunch, Rich provided me with 
several pieces to the puzzle. To begin with, he suggested my 
problem might have to do with the excessive buffering in a device 
in my path rather than with the PowerBoost feature. He also 
pointed out that ICSI has a great tool called Netalyzr that helps 
you figure out what your buffering is. Also, much to my surprise, 
he said a number of ISPs had told him they were running without 
any queue management whatsoever—that is, they weren't running RED 
on any of their routers or edge devices.

The very next day I managed to get a wonderful trace. I had been 
having trouble reproducing the problem I'd experienced earlier, 
but since I was using a more recent cable modem this time around, 
I had a trivial one-line command for reproducing the problem. The 
resulting SmokePing plot clearly showed the severity of the 
problem, and that motivated me to take a packet-capture so I 
could see just what in the world was going on. About a week 
later, I saw basically the same signature on a Verizon FiOS [a 
bundled home communications service operating over a fiber 
network], and that surprised me. Anyway, it became clear that 
what I'd been experiencing on my home network wasn't unique to 
cable modems.

VC I assume you weren't the only one making noises about these 
sorts of problems?

JG I'd been hearing similar complaints all along. In fact, Dave 
Reed [Internet network architect, now with SAP Labs] about a year 
earlier had reported problems in 3G networks that also appeared 
to be caused by excessive buffering. He was ultimately ignored 
when he publicized his concerns, but I've since been able to 
confirm that Dave was right. In his case, he would see daily high 
latency without much packet loss during the day, and then the 
latency would fall back down again at night as flow on the 
overall network dropped.

Dave Clark [Internet network architect, currently senior research 
scientist at MIT] had noticed that the DSLAM (Digital Subscriber 
Line Access Multiplexer) his micro-ISP runs had way too much 
buffering—leading to as much as six seconds of latency. And this 
is something he'd observed six years earlier, which is what had 
led him to warn Rich Woundy of the possible problem.

VC Perhaps there's an important life lesson here suggesting you 
may not want to simply throw away outliers on the grounds they're 
probably just flukes. When outliers show up, it might be a good 
idea to find out why.

NICK WEAVER But when testing for this particular problem, the 
outliers actually prove to be the good networks.

JG Without Netalyzr, I never would have known for sure whether 
what I'd been observing was anything more than just a couple of 
flukes. After seeing the Netalyzr data, however, I could see how 
widespread the problem really was. I can still remember the day 
when I first saw the data for the Internet as a whole plotted 
out. That was rather horrifying.

NW It's actually a pretty straightforward test that allowed us to 
capture all that data. In putting together Netalyzr at ICSI, we 
started out with a design philosophy that one anonymous commenter 
later captured very nicely: "This brings new meaning to the 
phrase, 'Bang it with a wrench.'" Basically, we just set out to 
hammer on everything—except we weren't interested in doing a 
bandwidth test since there were plenty of good ones out there 
already.

I remembered, however, that Nick McKeown and others had ranted 
about how amazingly over-buffered home networks often proved to 
be, so buffering seemed like a natural thing to test for. It 
turns out that would also give us a bandwidth test as a side 
consequence. Thus we developed a pretty simple test. Over just a 
10-second period, it sends a packet and then waits for a packet 
to return. Then each time it receives a packet back, it sends two 
more. It either sends large packets and receives small ones in 
return, or it sends small packets and receives large ones. During 
the last five seconds of that 10-second period, it just measures 
the latency under load in comparison to the latency without load. 
It's essentially just a simple way to stress out the network.

We didn't get around to analyzing all that data until a few 
months after releasing the tool. Then what we saw were these very 
pretty graphs that gave us reasonable confidence that a huge 
fraction of the networks we had just tested could not possibly 
exhibit good behavior under load. That was a very scary 
discovery.

JG Horrifying, I think.

NW It wasn't quite so horrifying for me because I'd already 
effectively taken steps to mitigate the problem on my own 
network—namely, I'd paid for a higher class of service on my home 
network specifically to get better behavior under load. You can 
do that because the buffers are all sized in bytes. So if you pay 
for the 4x bandwidth service, your buffer will be 4x smaller in 
terms of delay, and that ends up acting as a boundary on how bad 
things can get under load. And I've taken steps to reduce other 
potential problems — by installing multiple access points in my 
home, for example.

JG The problem is that the next generation of equipment will come 
out with even larger buffers. That's part of why I was having 
trouble initially reproducing this problem with DOCSIS (Data over 
Cable Service Interface Specification) 3.0 modems. That is, 
because I had even more extreme buffering than I'd had before, it 
took even longer to fill up the buffer and get it to start 
misbehaving.

VC What I think you've just outlined is a measure of goodness 
that later proved to be exactly the wrong thing to do. At first, 
the equipment manufacturers believed that adding more buffers 
would be a good thing, primarily to handle increased traffic 
volumes and provide for fair access to capacity. Of course, it 
has also become increasingly difficult to buy a chip that doesn't 
have a lot of memory in it.

NW Also, to the degree that people have been testing at all, 
they've been testing for latency or bandwidth. The problem we're 
discussing is one of latency under load, so if you test only 
quiescent latency, you won't notice it; and if you test only 
bandwidth, you'll never notice it. Unless you're testing 
specifically for behavior under load, you won't even be aware 
this is happening.

VAN JACOBSON I think there's a deeper problem. We know the cause 
of these big queues is data piling up wherever there's a 
fast-to-slow transition in the network. That generally happens 
either going from the Internet core out to a subscriber (as with 
YouTube videos) or from the subscriber back into the core, where 
a fast home network such as a 54-megabit wireless hits a slow 1- 
to 2-megabit Internet connection.

[snip]

Dewayne-Net RSS Feed: <http://www.warpspeed.com/wordpress>

-------------------------------------------
Archives: https://www.listbox.com/member/archive/247/=now
RSS Feed: https://www.listbox.com/member/archive/rss/247/5577888-0953570e
Modify Your Subscription: https://www.listbox.com/member/?member_id=5577888&id_secret=5577888-2f0d3142
Unsubscribe Now: https://www.listbox.com/unsubscribe/?member_id=5577888&id_secret=5577888-ceb3b021&post_id=20111209095251:75F2767E-2275-11E1-A9D2-82C13E0D22EC
Powered by Listbox: http://www.listbox.com