Thursday, April 12, 2012

Raw data is good, Normalized is better, Categorized is awesome!

Continuing my last post on why using normalized data is better than just using raw data and how it accelerates the analisys process resulting in a faster response and therefor money saving, I'd like to focus now in the data mining aspect.

Remember the scenario: you are the IT responsible for your company's custom developed transaction application and your boss ask you to send him a report with all the activity related to the account number 1234567890.

Of course you can give all the raw information to your boss, but I'd not sure he is gonna like the idea of receiving a 20 pages report with all the entries where that account has been involved...

Having the data in raw format is good, we need it, but data mining is very difficult on it.

I'd prefer to give him some data more easy to handle, maybe a excel file where the information is easy to visualize, filter, group etc....Maybe create some graphs...


Imagine that your application logs look something like this:

Mar 09 04:28:58 192.168.1.101 tx=897218 user=aalonso src=71.142.234.66 geo=SP num=1234567890 trans=Purchase status=0 amt=2976.52 sha1hex=13b27510900ee31f64ecd1f80ff52b4f5fa6fcfd

Why not try to understand all this transaction application raw data and normalize the content of this information as pairs: field:value. It's a one time work that will give me so much value afterwards.

Your tool allows you to parse on the fly applying regular expressions to the raw event, doing this I'm able to structure the information in different fields like: user name, account number, transaction type, amount, transaction id, source, etc...



But I'm pretty sure you don't want to do this parsing on the fly every time you need to do forensics. So why not just reuse the parser you have created and normalize all the future incoming data when events are received and stored. So this one time effort can be used by you or other people in your organization.

Visualization and data mining of this information in a structured way it's much better than just see the raw data. Now you can order by amount of the transaction, can create statistics of transaction types, source countries, top pairs of sources-accounts and any other think you can imagine.Any extra request on this data your boss could have will be easy to handle.

Actually imagine you boss ask you what means that "status" field. After talking to the application developers you find out that means failed or successful. So, again, why not add this knowledge you have already learned and categorize the events. When I say "categorize", I mean to add some extra meta-information to the event in order to facilitate its understanding and eventually accelerate the forensics and investigation response time or, in other words, save money.


In some one-time scenarios with raw data can be enough, but for most cases keeping the data normalized will give you a faster and easier data mining. That's why most of the vendors spend such an effort in create parsers to normalized and categorized the events coming from several log sources, to make your life easier. So why don't keep the best of both worlds?

No comments:

Post a Comment