Smart SIEM: Raw data is good, Normalized is better!

This week my company arranged a seminar on log management and I had the opportunity to make a demo of one of our products.

My goal was to show why using normalized data is better than just using raw data and how this accelerate the analisys process resulting in a faster response and therefor money saving.

When I talk of normalized data, I mean that the information contained in a event is split in different pairs of "field:value". Saying it in a different way, we understand the content of that event.

Imagine an scenario where you are the IT responsible for your company custom developed transaction application. This application manage all the economical transaction between the company's different locations and the central server. You, aware as you are of the importance of having the logs properly secured, have the logs of this application sent into your company log management system.

As you don't have any specific usage for this logs beyond possible future troubleshooting and there isn't any regulation requirements which specify anything else, you decided that having the logs directly in raw format is enough. When I say "raw" format, I mean storing the logs just how they are created in the application, with other words, without understanding its content. And this may look as a complete valid statement for some cases.

But imagine...
It's Thursday afternoon, almost time to go home and suddenly you get a request from your boss: "I need a report of everyone who is accessing to customer account number 1234567890 asap!". Let's go, analysis time:

First you need to localize all the transactions where the account number is involved. These should be fairly easy because your tool allows to make a full-text search trough all the logs.

The application generates around 70 GB of data per day, in other words, 250 million events (300 bytes event size) per day.

As your system support raw full-text indexed search, searching for this account number in the last 24 hours it only takes around 16 minutes if you product doesn't support full-text indexed searches, go for dinner and come back tomorrow).

Ok, this works, but if the account number would be normalized into a specific field (ideally with a length no bigger than 32bytes) and this field is indexed, the time of this query could have been reduced up to 40%, in this case let's estimate pessimistic and assume that the total new query time is 10 minutes (35%).

5 minutes reduction can't impress to much when this is a isolated request. And therefor keep the information in raw format is a completely valid solution....
But if we think a bit bigger, what happens if you need to generate this report for all the three months the information is stored online?

Raw data (full-text indexed) search: 16 min/day x 90 days =24 hours
Normalized data (indexed field) search = 10 min/day x 90 days = 15 hours

9 hours different can mean a lot now, actually it can be the different between give an answer on Friday or working on Saturday.....

In addition, think what happens if this is a common request from more costumers, how many account numbers are registered in your application? Twenty thousand you said? Uff!

Note: all the calculations presented in this text are based on rough estimations extracted from our lab environment and in any case can be taken as a absolute numbers to be used as a decision maker.

Smart SIEM

Sunday, March 11, 2012

Raw data is good, Normalized is better!

No comments:

Post a Comment

Followers

Blog Archive

About us