Background

We’ve known for a while that some proxies, anti-virus programs, and other client side applications that do content analysis munge HTTP request headers to disable compression. They do this because it’s easier to do content analysis on uncompressed content. For web service providers though, this means having to serve uncompressed content to users, resulting in higher bandwidth bills.

This was first discovered by Tony Gentilcore and Andy Martone at Google, and they presented their findings and a workaround at Velocity 2009 and 2010. Marcel’s post on the YDN blog has the details and an alternate solution, so I won’t go into them here, but in short, it says that 15% of requests did not have the Accept-Encoding: gzip header.

The state today

A few days ago, Avi Keinan posted some stats to the Webpagetest forums about the state of compression headers today. He found that the numbers were closer to 1%, but his sample size was fairly small, so I decided to retry it with a subset of our data from August 15. This is what we found:

Sample size
17,692,213
Requests without a suitable Accept-Encoding header
190,364 (1.07%)
Unique requests without a suitable Accept-Encoding header
11,884
Unique headers that may have once been an Accept-Encoding header
29

So there were 29 headers that might have been an accept-encoding header. I arrived at these by a simple method of eliminating all the headers I knew about, and then all the headers that made sense in some context or the other. The ones that were left fell into a few patterns:

Header name Header value What it might have been
-------
(7 dashes)
----:-\{18-251\}
(4 dashes, colon, 18-251 dashes)
?
x-xxxxx
(x, dash, 5 x)
x\{22-210}
(22 to 210 'x'es)
?
xxxxxxx
(7 'x'es)
X\{30-117}
(30 to 117 'X'es)
?
----------
(10 dashes)
----------
(10 dashes)
?
xxxxxxxxxx
(10 'x'es)
XXXXXXXXXX
(10 'X'es)
?
----------------
(16 dashes)
-\{10\}
(10 dashes)
Proxy-Connection: Keep-Alive
x-xxxxxxxxxxxxxxxxx
(x, dash, 17 'x'es)
x\{26,30\}
(26 or 30 'x'es)
?
---------------
(15 dashes)
-\{4,12,17\}
(4, 12 or 17 dashes)
Accept-Encoding: gzip, deflate, sdch
aaaaaaaaaaaaaaa
(15 'a's)
+\{14,18\}
(14 or 18 pluses)
Accept-Encoding: ?
xxxxxxxxxxxxxxx
(15 'x'es)
[+X]\{13,14\}
(13 or 14 pluses or 'X'es)
Accept-Encoding: ?
accelate
accepate
acceptte
accflate
aceflate
adeflate
(empty) Accept-Encoding: deflate
accept-en
accept-encodind
accept-encodxng
accept-xncoding
x-cept-encoding
xccept-encoding
gzip, deflate, sdch Accept-Encoding: gzip, deflate, sdch
te chunked Transfer-Encoding
x-cnection close Connection: close
xroxy-connection Keep-Alive Proxy-Connection: Keep-Alive

Not all of these map to the Accept-Encoding header, but a large number of them do.

Who’s doing this?

I looked to see if there were any patterns with the user agents or other headers that would cluster these requests together. There didn’t seem to be any major patterns.

Browser/OS

We see requests from the following Browser/OS combinations (in decreasing order of popularity):

  • IE/Windows
  • Chrome/Windows
  • Firefox/Windows
  • Opera/Windows
  • Safari/Windows
  • Safari/Mac OS X
  • Safari/iOS 5.1
  • Lunascape/Windows

The common theme here is Windows. 99.668% of all requests with mangled accept-encoding headers were from a browser running on Windows.

The iOS requests all came through a proxy that identified itself as localhost.localdomain, so my best guess is that this was someone running their iPhone through their desktop.

The Mac OS X requests appeared to also come in through a proxy, which sets an _sm_au cookie.

Proxies

Which leads us to Proxies.

I see many requests with a cookie named _sm_au and value aaaaaaaaaaaaaaaaaaaa. This is weird because we don’t set any cookies, so we should never receive a cookie from a client. I was unable to find any information on what cookie this is, so any ideas are appreciated, However, it could possibly be SMProxy.

A small number of requests also had an MSISDN number, suggesting that they were mobile devices, however they all ran Firefox 14 on Windows.

Browser Plugins

Unfortunately we don’t currently have information on any browser plugins that may have caused these changes, but I’ll be looking for that going forward.

Summary

The number of requests coming in with mangled accept-encoding headers has definitely gone down from 15% to about 1%… or this could be simply because we’re not Google. Some of these appear to be mangled by proxies, but not all of them.

1% of requests is still a large number when you’re handling over a billion hits a month, so it’s possibly still worth it to ignore some of these headers and gzip content anyway.