Last Updated: 2017-04-09 20:27:41 UTC
by Didier Stevens (Version: 1)
A was asked if I could share the files of my last diary entry: Domain Whitelisting With Alexa and Umbrella Lists.
You can find the files on my site here. And to teach you how to fish :-), here are the commands I used to produce these lists:
csv-cut.py -s "\t" 1 emd.txt > blocklist.txt
csv-lookup.py -s , -e blocklist.txt 0 top-1m-umbrella.csv 1 0 blocklist-umbrella.csv
csv-lookup.py -s , -e blocklist.txt 0 top-1m-alexa.csv 1 0 blocklist-alexa.csv
My csv tools can be found on my Beta GitHub repository.
My assumption when I read this blog post, was that the blocklisted domains would rank low in the Alexa and Umbrella lists. They don't, look at the histograms of the rankings.
Blocklisted domains with Alexa rank:
Blocklisted domains with Umbrella rank:
These long tail distributions indicate that blocklisted domains with higher ranks are more prevalent than those with lower ranks. This is also reflected in the ranking median (287,251 for Alexa and 393,879 for Umbrella) and average (350,553 for Alexa and 420,846 for Umbrella).
Conclusion: don't use Alexa and Umbrella top 1,000,000 lists as whitelists blindly, even if you just use the top 1000 or 10000.