Last Updated: 2012-08-16 23:31:11 UTC
by Johannes Ullrich (Version: 1)
I still think, DNS logs are one of the most overlooked resources for intrusion and malware detection. Frequently, command and control servers will use specific top level domains or host names, and due to short TTL values, infected hosts will frequently query DNS servers for these names.
Additionally, DNS servers are overlooked choke points, which are as valuable to collect network wide data as firewalls and routers connecting the network to the internet.
In this diary, I would like to introduce a simple shell script to answer one question that in my opinion is quite useful to detect anomalous DNS queries: Which are the top 10 new host names that we looked up today.
First, you need DNS query logs, there are two ways to collect them: you could either enable query logging in your DNS server, or you could just use tcpdump on the DNS server to collect the logs. Query logging works fine for me, but it can put too much strain on a very busy name server. Running tcpdump on the name server, or a sensor monitoring the name server, may work better. We do not have to capture every single query for this technique to work.
First, we need to summarize past queries. In my case, the query logs are rotated hourly, and saved in files with names like "query.log.*" (* is a number). A sample line from my query logs:
16-Aug-2012 21:42:00.260 queries: info: client 10.5.0.210#54481: query: a1406.g.akamai.net IN A + (192.0.2.1)
To extract the host names, and summarize them, I use the following script:
cat query.log.*| sed -e 's/.*query: //' | cut -f 1 -d' ' | sort | uniq -c | sort -k2 > oldlog
This will sort the output by hostname (sort -k2 sorts by the second column), which becomes important later.
Next, I apply the same procedure to the current log:
cat query.log| sed -e 's/.*query: //' | cut -f 1 -d' ' | sort | uniq -c | sort -k2 > newlog
Now, we need to find all entries in "newlog", that are not included in "oldlog". To do so, we use the bash command "join", which works pretty much like the SQL command join, but uses the two text files as input. It is important that the "join" column (the host name) is sorted, which was the reason for the -k 2 argument earlier.
join -1 2 -2 2 -a 2 oldlog newlog > combined
-a 2 will include all records from newlog that are not found in oldlog. "combined" now includes lines from both files, as well as the lines only found in "newlog". We need to remove the lines found in both files (which are identified by having two numbers):
cat combined | egrep -v '.* [0-9]+ [0-9]+$' | sort -nr -k2 | head -10
In the end, we sort the host names by frequency, and return the top 10.
To summarize the script for simple "copy/paste".I broke some lines up to a
cat $oldlogs | sed -e 's/.*query: //' | cut -f 1 -d' ' | sort | uniq -c | sort -k2 > $tmpdir/oldlog
cat $newlog | sed -e 's/.*query: //' | cut -f 1 -d' ' | sort | uniq -c | sort -k2 > $tmpdir/newlog
join -1 2 -2 2 -a 2 oldlog newlog | egrep -v '.* [0-9]+ [0-9]+$' | sort -nr -k2 | head -10 > $tmpdir/suspects
The file "suspects" will now include the top 10 suspect domains. For added credit: add the ability to keep a whitelist.