kobame
kobame

Reputation: 5856

Changing CSV file with regexes

Because meantime i wrote an answer to the question, what got closed - trying to reword and re-ask it.

Having an CSV file with 180 milions records, with 5 columns as:

"c a","L G-3 (8) N (4th G P Q C- 4 R- 1 T H- 15.6 I- W 8.1) (B)","C & P_L",1,0

How to change it to the 3 column structure as:

"c a|L G-3 (8) N (4th G P Q C- 4 R- 1 T H- 15.6 I- W 8.1) (B)|C & P_L",1,0

e.g. need concatenate the colums 1,2,3 with | and print it as one column and leave other colums unchanged

Tried it with regexes:

cat RelatedKW.csv | perl -pe 's/(\|)/\//g'| perl -pe 's/("\s*"|"\s*"\s*\\n$)//g'| perl -pe 's/^,"|,,|"\s*,\s*\"/|/g' | perl -pe 's/\"(\d+),(\d+)\"/ |$1|$2/g' > newRKW4.csv`

Is here any better way?

Upvotes: 1

Views: 85

Answers (2)

sam
sam

Reputation: 1290

Assuming your data is exactly like what it is this should work

$line =~ s-\",\"-|-g;

Upvotes: 0

kobame
kobame

Reputation: 5856

You should generally avoid parsing CSVs with regex, as Kent Fredric explains in answer to another similar question:

Not using CPAN is really a recipe for disaster.

Please consider this before trying to write your own CSV implementation. Text::CSV is over a hundred lines of code, including fixed bugs and edge cases, and re-writing this from scratch will just make you learn how awful CSV can be the hard way.

It is really bad practice trying to parse CSVs with regexes, because for example, you need to handle:

  • escaped quotes
  • escaped separator characters
  • fields containing the delimiter

and so on, all of which Text::CSV will handle for you.

Here's a solution that uses Text::CSV. I'm not a Perl expert, so the following code may be missing some things, but it is probably better than using regexes:

perl -MText::CSV_XS -E '$csv = Text::CSV_XS->new ({ eol => $/ }); $csv->print(*STDOUT, [join(q{|}, @$row[0..2]), @$row[3..4]]) while ($row = $csv->getline(*STDIN))' < csv

Input:

"c a","L G-3 (8) N (4th G P Q C- 4 R- 1 T H- 15.6 I- W 8.1) (B)","C & P_L",1,0

Output:

"c a|L G-3 (8) N (4th G P Q C- 4 R- 1 T H- 15.6 I- W 8.1) (B)|C & P_L",1,0

Some potential problems: it doesn't handles escaping of the | character, if there are any in the input, no error handling, etc. For a better solution you need to write a full-featured Perl script and not a one-liner.

Upvotes: 1

Related Questions