Visualizing and Cleaning Traffic Logs - Hands On Guide

I have spent quite a bit of time with the VAST 2013 Mini Challenge 1. The given network traffic log is interesting, but bears some challenges. One of them is the ominous source/destination confusion where the network flow collector didn't correctly record the client side of the connection as the source, but recorded it as the destination. That will create all kinds of problems in your data analysis and you therefore have to fix that first.

I wrote a blog entry on Cleaning Up Network Traffic Logs where I am going step by step through the network logs to determine which records need to be turned around. I am using both SQL and some parallel coordinate visualizations to get the job done. The final outcome is this one-liner Perl hack to actually fix the data:

$ cat nf*.csv | perl -F\,\ -ane 'BEGIN {@ports=(20,21,25,53,80,123,137,138,389,1900,1984,3389,5355);
%hash = map { $_ => 1 } @ports; $c=0} if ($hash{$F[7]} && $F[8}>1024)
{$c++; printf"%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s",
$F[0],$F[1],$F[2],$F[3],$F[4],$F[6],$F[5],$F[8],$F[7],$F[9],$F[10],$F[11],$F[13],$F[12],
$F[15],$F[14],$F[17],$F[16],$F[18]} else {print $_} END {print "count of revers $c\n";}

Read the full article here: Cleaning Up Network Traffic Logs

If you want to know how to setup a columnar data store to query the network flows, I also wrote a quick step by step guide on loading the network traffic logs into Impala with a Parquet storage engine.