Parsing CloudFront logs using Hadoop and Pig on Amazon’s Elastic MapReduce

I spent my Saturday throwing together a Pig script which will parse the Cloudfront Logs and spit out the bandwidth consumed per object.

It took me a while to get the details and synax right, so I figured I’d share it here.

A few things missing:

1. Right now it parses everything – should probably be incremental, only parsing the things since the last run. I was thinking of renaming the directory that the logs go into right before processing, and then process on the renamed directory. That way, the original log directly will be recreated by the cloudfront logging process, and the new log files should go into that directory.

2. Dealing with streaming logs. I do have the script written for this, but haven’t tested it.

3. Actually setting this up in a scenario where it’ll get run automatically every day or every few hours and the resulting data stored somewhere in the database.

To run: Input is the directory containing the CloudFront log files. The script will read all of them and knows how to gunzip them, too. Output is the directory you want the output to be written to.