Mitesh Patel
Mitesh Patel

Reputation: 1

REGEX_EXTRACT error in PIG

I have a CSV file with 3 columns: tweetid , tweet, and Userid. However within the tweet column there are comma separated values.

i.e. of 1 row of data:

`396124437168537600`,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143

I want to extract all 3 fields individually, but REGEX_EXTRACT is giving me an error with this code:

a = LOAD tweets USING PigStorage(',') AS (f1,f2,f3);

b = FILTER a BY REGEX_EXTRACT(f1,'(.*)\\"(.*)',1);

The error is:

error: Filter's condition must evaluate to boolean.

Upvotes: 0

Views: 396

Answers (2)

Murali Rao
Murali Rao

Reputation: 2287

In the use case shared, reading the data using PigStrorage(',') will result in missing savava143 (last field value)

A = LOAD '/Users/muralirao/learning/pig/a.csv' USING PigStorage(',') AS (f1,f2,f3);
DUMP A;

Output : A : Observe that the last field value is missing.

(396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.")

For the use case shared, to extract all the values from CSV file with field values having ',' we can use either CSVExcelStorage or CSVLoader.

Approach 1 : Using CSVExcelStorage

Ref : http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html

Input : a.csv

396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143

Pig Script :

REGISTER piggybank.jar;
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (f1,f2,f3); 
DUMP A;

Output : A

(396124437168537600,I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.,savava143)

Approach 2 : Using CSVLoader

Ref : http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html

Below script makes use of CSVLoader(), DUMP A will result in the same output seen earlier.

A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (f1,f2,f3);

Upvotes: 2

glefait
glefait

Reputation: 1691

The error is that you do not want to FILTER based on a regex but GENERATE new fields based on a regex. To filter, you need to know if the line have to be filtered, hence the boolean requirement.

Therefore, you have to use :

b = FOREACH a GENERATE REGEX_EXTRACT(FIELD, REGEX, HOW_MANY_GROUPS_TO_RETURN);

However, as @Murali Rao said, your values are not just coma separated but CSV (think how you will handle a coma in tweet : it is not a field separator, just some content).

Upvotes: 0

Related Questions