Processing unstructured data with Hadoop + MapReduce

Question

I would like to use hadoop to process unstructured CSV files. These files are unstructured in the sense that they contain multiple data values from different types with varying row lengths. In addition, there are hundreds of these files and they are often relatively large in size (> 200Mb).

The structure of each file can be demonstrated like so:

Book     , ISBN          , BookName                     , Authors    , Edition
Book     , 978-1934356081, Programming Ruby 1.9         , Dave Thomas, 1
Book     , 978-0596158101, Programming Python           , Mark Lutz  , 4
...
BookPrice, ISBN          , Store                        , Price
BookPrice, 978-1934356081, amazon.com                   , 30.0
BookPrice, 978-1934356081, barnesandnoble.com           , 30.67
BookPrice, 978-0596158101, amazon.com                   , 39.55
BookPrice, 978-0596158101, barnesandnoble.com           , 44.66
...
Book     , ISBN          , BookName                     , Authors    , Edition
Book     , 978-1449311520, Hadoop - The Definitive Guide, Tom White  , 3
...

The files are generated automatically, and I have no control over the given structure. Basically, there's a header row followed by data rows containing values matching the headers. The type of row can be identified by the first comma separated word. So, from the example, the Book row contains metadata about books (name, isbn, author, edition), and the BookPrice contains the various prices for the books for different outlets/vendors.

I'm trying to understand how to use Map/Reduce to perform some aggregate calculations on the data. Having the data structured the way it is makes it more difficult to understand what key -> value pairs to extract in each phase.

For example, I'd like to calculate the AVERAGE, MAX and MIN prices for each book (can be joined/grouped by ISBN). I realize I can do some pre-processing to extract that data to ordered, one-type CSV files and work from there (using grep, python, awk, etc), but that'll defeat the point of using M/R+Hadoop, and will require a lot of additional work.

I thought about using multiple map stages, but I'm fairly new to all of this and not sure how/where to start.

How do I go about implementing such M/R job (in Java) for the sample file/query? Thanks.

Processing unstructured data with Hadoop + MapReduce

Answers (1)

Related Questions