Reputation: 564
I want to use Apache pig
, but until now I have just parsed formatted data like csv or comma separated etc.
But if I have some data separated by ';' & '@&@' etc, how can I work with it?
Like when I used MapReduce I split data by ";" in map and then again by "@&@" in reduce.
Also suppose for example we have a csv file with first field username which is made by "FirstnameLastname" format,
raw = LOAD 'log.csv' USING PigStorage(',') AS (username: chararray, site: chararray, viwes: int);
By above example we can just get whole username, but how can I get both Name and Lastname different?
Upvotes: 2
Views: 4575
Reputation: 21
Maybe you can use STRSPLIT to split the string the second time:
Also ;
could be split by \\u003B
Upvotes: 2
Reputation: 39893
You can do just about anything Java or Python can do with UDFs in Pig. Pig is not intended to have an exhaustive set of processing functions, but just provide basic functionality. Piggybank fills the niche of custom code for doing stuff by collecting a bunch of community-contributed UDFs. Sometimes, piggybank just doesn't have what you need. It's a good thing UDFs are pretty simple to write.
You could write a custom loader that handles the unique structure of your data at load time. The custom load function manipulates the data with Java code and outputs its structured columnar format that Pig is looking for. Another nice thing about customer loaders is you can specify the load schema so you don't have to write out the AS (...)
A = LOAD 'log.csv' USING MyCustomLoader('whatever', 'parameters);
You could write a custom evaluation function. Sometimes a function like SPLIT
or TOKENIZE
just isn't good enough. Use TextLoader to get your data in line-by-line, and then following up with a UDF to parse that line and output a tuple (which can then be flattened into columns).
A = LOAD 'log.csv' USING TextLoader() as (line:char array);
B = FOREACH A GENERATE FLATTEN(CustomLineParser(line));
Upvotes: 4