Desultory Monday...

· Read in about 2 min · (312 words) ·

This entry was posted using Its all text on Firefox 3.0 RC2 on Ubuntu Hardy heron, with emacs 23 snapshot as the editor. I love it :-)

Well, Its all Text is great if you hate typing into webforms with textboxes that make editing such a big pain in the butt.

Its great to see that Its All text has been updated to work with FF 3.0 now. The fun would be to see if this works on Windows with cygwin emacs as the editor. Had problems the last time I tried that - but that’s been sometime ago now.

Today’s been a desultory Monday. Spent sometime getting emacs snapshot with pretty fonts on my hardy. Its beautiful.

The next thing has been mostly scratching my head on hadoop. What I’d like to do is parse an access log and generate multiple outputs - ie single input of gobs of web access logs and multiple outputs - with say requests by country, popular pages, % of client browser and so on.

  1. parse web log

  2. pull out remote ips and use geo ips to find the originating country

  3. pull out user agent field and figure out browser distribution.

  4. Filter the requested resource and pull out only pages - find pages by popularity

Now there seem to be quite a number of ways of doing this -

  • Code the whole thing in Java - and this is where I’m getting into analysis paralysis. Look at ways to generate multiple outputs from MapRed and then use Job and JobControl to setup the pipeline.

  • Use Pig - Pig examples on the Pig overview page seem to suggest that this should be trivial with Pig.

  • Use Cascading - seems to be doing the same thing - will need to do this in JRuby or Groovy though.

Will post an update once I get through the java route