Reputation: 307
I am trying to generate following ... Input 396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09 396124440112951296,"00:00 #MAW",WesleyBitton
A = LOAD '/user/root/data/tweets.csv' USING PigStorage(',') as (users:chararray, tweets:chararray);
B = FILTER A by users == '396124436476092416';
output truncated (396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift)
Output excepting (396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse")
I do not want to read row as line.
Upvotes: 0
Views: 170
Reputation: 1445
You can use CSVLoader for loading data
however if you do not wish to do that here is the work around in Apache Pig itself for that :
--Load your Data
A = LOAD 'your/path/users.csv' USING TextLoader() AS (unparsed:chararray);
--Replace your "
string with |
so as to separate your tweets
B = FOREACH A GENERATE REPLACE(unparsed, '\\"', '|') AS parsed:chararray;
--store your temporary parsed data into your location
STORE B INTO 'your/path/parsed_users.csv' USING PigStorage('|');
--load your parsed data
C = LOAD 'your/path/parsed_users.csv' USING PigStorage('|') AS (users:chararray, tweets:chararray);
--Dump your data , how ever this will still contain one extra comma(,
) but you can replace it by using the replace function you get the point.
DUMP C;
Upvotes: 1
Reputation: 3973
Thats fit in the csv standardization, so you need just to use CSVLoader which
supports double-quoted fields that contain commas and other double-quotes escaped with backslashes.
This is how to use it :
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD '/user/root/data/tweets.csv' USING CSVLoader AS (users:chararray, tweets:chararray);
B = FILTER A by users == '396124436476092416';
Upvotes: 0