Parsing CloudFront logs using Hadoop and Pig on Amazon's Elastic MapReduce

Other

Nov 21, 2010

by Calvin Correli

I spent my Saturday throwing together a Pig script which will parse the Cloudfront Logs and spit out the bandwidth consumed per object.

It took me a while to get the details and synax right, so I figured I'd share it here.

A few things missing:

1. Right now it parses everything - should probably be incremental, only parsing the things since the last run. I was thinking of renaming the directory that the logs go into right before processing, and then process on the renamed directory. That way, the original log directly will be recreated by the cloudfront logging process, and the new log files should go into that directory.

2. Dealing with streaming logs. I do have the script written for this, but haven't tested it.

3. Actually setting this up in a scenario where it'll get run automatically every day or every few hours and the resulting data stored somewhere in the database.

To run: Input is the directory containing the CloudFront log files. The script will read all of them and knows how to gunzip them, too. Output is the directory you want the output to be written to.

Enjoy!

Like

About Calvin Correli

I've spent the last 17 years learning, growing, healing, and discovering who I truly am, so that I'm now living every day aligned with my life's purpose.

Parsing CloudFront logs using Hadoop and Pig on Amazon's Elastic MapReduce

About Calvin Correli

Read next

Tracking errors in collectiveidea's DelayedJob using Exceptional

2 comments