dipayan
dipayan

Reputation: 72

Removal Double Quote(") from CSV file using PIG

I am trying to remove double quotes(") from file.Some of the field has data like "Newyork,NY". Please advice me what to do?I have tried to delete (") from CSV.But it is not happening.Stepwise Codes are given below:

I am opening pig using pig -x local

1st Step:

test4 = LOAD '/home/hduser/Desktop/flight_data.csv'
        using PigStorage(',') AS (
YEAR: chararray,
QUARTER: chararray,
MONTH: chararray,
DAY_OF_MONTH: chararray,
DAY_OF_WEEK: chararray,
FL_DATE: chararray,
UNIQUE_CARRIER: chararray,
AIRLINE_ID: chararray,
CARRIER: chararray,
TAIL_NUM: chararray,
FL_NUM: chararray,
ORIGIN: chararray,
ORIGIN_CITY_NAME: chararray,
ORIGIN_STATE_ABR: chararray,
ORIGIN_STATE_FIPS: chararray,
ORIGIN_STATE_NM: chararray,
ORIGIN_WAC: chararray,
DEST: chararray,
DEST_CITY_NAME: chararray,
DEST_STATE_ABR: chararray,
DEST_STATE_FIPS: chararray,
DEST_STATE_NM: chararray,
DEST_WAC: chararray,
CRS_DEP_TIME: chararray,
DEP_TIME: chararray,
DEP_DELAY: chararray,
DEP_DELAY_NEW: chararray,
DEP_DEL15: chararray,
DEP_DELAY_GROUP: chararray,
DEP_TIME_BLK: chararray,
TAXI_OUT: chararray,
WHEELS_OFF: chararray,
WHEELS_ON: chararray,
TAXI_IN: chararray,
CRS_ARR_TIME: chararray,
ARR_TIME: chararray,
ARR_DELAY: chararray,
ARR_DELAY_NEW: chararray,
ARR_DEL15: chararray,
ARR_DELAY_GROUP: chararray,
ARR_TIME_BLK: chararray,
CANCELLED: chararray,
CANCELLATION_CODE: chararray,
DIVERTED: chararray,
CRS_ELAPSED_TIME: chararray,
ACTUAL_ELAPSED_TIME: chararray,
AIR_TIME: chararray,
FLIGHTS: chararray,
DISTANCE: chararray,
DISTANCE_GROUP: chararray,
CARRIER_DELAY: chararray,
WEATHER_DELAY: chararray,
NAS_DELAY: chararray,
SECURITY_DELAY: chararray,
LATE_AIRCRAFT_DELAY: chararray); 

2nd Step:

new_data = foreach test4 generate
FLATTEN(REGEX_EXTRACT(ORIGIN_CITY_NAME,'."([^"])"',1)) AS StateName;

After I writing this command,in new_data variable fields are saving like (). Please suggest me some option to overcome this problem.Thanks in advance for your help.

I have tried with another way also,given below:

aviation_data = foreach test4 generate
REGEX_EXTRACT($0,'([0-9]+)', 1),
REGEX_EXTRACT($1,'([0-9]+)', 1),
REGEX_EXTRACT($2,'([0-9]+)', 1),
REGEX_EXTRACT($3,'([0-9]+)', 1),
REGEX_EXTRACT($4,'([0-9]+)', 1),
REGEX_EXTRACT($5,'([0-9]+)', 1),
REGEX_EXTRACT($6,'([0-9]+)', 1),
REGEX_EXTRACT($7,'([0-9]+)', 1),
REGEX_EXTRACT($8,'([0-9]+)', 1),
REGEX_EXTRACT($9,'([0-9]+)', 1),
REGEX_EXTRACT($10,'([0-9]+)', 1),
REGEX_EXTRACT($11,'([0-9]+)', 1),
REGEX_EXTRACT($12,'([0-9]+)', 1),
REGEX_EXTRACT($13,'([0-9]+)', 1),
REGEX_EXTRACT($14,'([0-9]+)', 1),
REGEX_EXTRACT($15,'([0-9]+)', 1),
REGEX_EXTRACT($16,'([0-9]+)', 1),
REGEX_EXTRACT($17,'([0-9]+)', 1),
REGEX_EXTRACT($18,'([0-9]+)', 1),
REGEX_EXTRACT($19,'([0-9]+)', 1),
REGEX_EXTRACT($20,'([0-9]+)', 1),
REGEX_EXTRACT($21,'([0-9]+)', 1),
REGEX_EXTRACT($22,'([0-9]+)', 1),
REGEX_EXTRACT($23,'([0-9]+)', 1),
REGEX_EXTRACT($24,'([0-9]+)', 1),
REGEX_EXTRACT($25,'([0-9]+)', 1),
REGEX_EXTRACT($26,'([0-9]+)', 1),
REGEX_EXTRACT($27,'([0-9]+)', 1),
REGEX_EXTRACT($28,'([0-9]+)', 1),
REGEX_EXTRACT($29,'([0-9]+)', 1),
REGEX_EXTRACT($30,'([0-9]+)', 1),
REGEX_EXTRACT($31,'([0-9]+)', 1),
REGEX_EXTRACT($32,'([0-9]+)', 1),
REGEX_EXTRACT($33,'([0-9]+)', 1),
REGEX_EXTRACT($34,'([0-9]+)', 1),
REGEX_EXTRACT($35,'([0-9]+)', 1),
REGEX_EXTRACT($36,'([0-9]+)', 1),
REGEX_EXTRACT($37,'([0-9]+)', 1),
REGEX_EXTRACT($38,'([0-9]+)', 1),
REGEX_EXTRACT($39,'([0-9]+)', 1),
REGEX_EXTRACT($40,'([0-9]+)', 1),
REGEX_EXTRACT($41,'([0-9]+)', 1),
REGEX_EXTRACT($42,'([0-9]+)', 1),
REGEX_EXTRACT($43,'([0-9]+)', 1),
REGEX_EXTRACT($44,'([0-9]+)', 1),
REGEX_EXTRACT($45,'([0-9]+)', 1),
REGEX_EXTRACT($46,'([0-9]+)', 1),
REGEX_EXTRACT($47,'([0-9]+)', 1),
REGEX_EXTRACT($48,'([0-9]+)', 1),
REGEX_EXTRACT($49,'([0-9]+)', 1),
REGEX_EXTRACT($50,'([0-9]+)', 1),
REGEX_EXTRACT($51,'([0-9]+)', 1),
REGEX_EXTRACT($52,'([0-9]+)', 1),
REGEX_EXTRACT($53,'([0-9]+)', 1),
REGEX_EXTRACT($54,'([0-9]+)', 1);

results are given below:

(2015,1,1,29,4,2015,,20304,,549,4837,,,,,04,,81,,,,,53,,93,1757,1851,54,54,1,3,1700,19,1910,2034,6,2005,2040,35,35,1,2,2000,0,,0,188,169,144,1,1107,5,0,0,0)

none of the text field is coming.

Upvotes: 1

Views: 1206

Answers (1)

Murali Rao
Murali Rao

Reputation: 2287

We can use either : org.apache.pig.piggybank.storage.CSVExcelStorage() or org.apache.pig.piggybank.storage.CSVLoader().

Refer the below API links for details

http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html

test4 = LOAD '/home/hduser/Desktop/flight_data.csv'
    USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (....)

Upvotes: 1

Related Questions