AfterWorkGuinness
AfterWorkGuinness

Reputation: 361

Perl regular expression help (parse out column)

Im stuck here. Not sure why my reg ex won't work. I have a pipe delimited text file with a series of columns. I need to extract the 3rd column.

File:

A|B|C|D|E|F|G|H|I
2011-03-03 00:00:00.0|1|60510271|254735|27751|BBB|1|-0.1619023623|-0.009865904
2011-03-03 00:00:00.0|1|60510270|254735|27751|B|3|-0.0064786612|-0.0063739185
2011-03-03 00:00:00.0|1|60510269|254735|27751|B|3|-0.0084998226|-0.009244384

Regular expression:

$> head foo | perl -pi -e 's/^(.*)\|(.*)\|(.*)\|(.*)$/$3/g'

Output

-0.1619023623
-0.0064786612
-0.0084998226

Clearly not the correct column being outputted.

Thoughts ?

Upvotes: 1

Views: 3030

Answers (6)

noob
noob

Reputation: 1

(?<=\|)\d{8}

Maybe this would work (?<=\|) positive look behind for a | character followed by 8 digits

Upvotes: 0

Berserk
Berserk

Reputation: 11

First thought was Text::CSV (mentioned by Matt B), but if the data looks like the example I'd say split is the right choise.

Untested:

$> head foo | perl -le 'while (<>) { print (split m{|})[2]; }'

If you really want a regex I would use something like this:

s{^ [^\|]* \| [^\|]* \| ([^\|]*) \| .*$}{$1}gx;

Upvotes: 1

kurumi
kurumi

Reputation: 25599

Normally, its easier/simpler(KISS) NOT to use regex for file format that have structured delimiters. Just split the string on "|" delimiter and get the 3rd field.

awk -F"|" '{print $3}' file

With Ruby(1.9+)

ruby -F"\|" -ane 'puts $F[2]' file

With Perl, its similar to the above Ruby one-liner.

perl -F"\|" -ane 'print $F[2]."\n"' file

Upvotes: 4

Billy Moon
Billy Moon

Reputation: 58521

You need to make your pattern greedy - so:

's/^(.*?)\|(.*?)\|(.*?)\|(.*)$/$3/g'

Upvotes: 1

Matt Ball
Matt Ball

Reputation: 359776

How about using a real parser instead of hacking together a regex? Text::CSV should do the job.

my $csv = Text::CSV->new({sep_char => "|"});

Upvotes: 1

Gareth McCaughan
Gareth McCaughan

Reputation: 19971

.* will by default match as much as it can, so your RE is picking out the last three columns (and everything before) rather than the first three (and everything after). You can avoid this in (at least) two ways: (1) instead of .*, look for [^|]*, or (2) make your repetition operators non-greedy: .*? instead of .*.

(Or you could explicitly split the string instead of matching the whole thing with a single RE. You might want to try both approaches and see which performs better, if it matters. Splitting is likely to give longer but clearer code.)

Upvotes: 1

Related Questions