Reputation: 2046
I have text in the form of separate lines, where each line has CSV-like format:
SOME BUNCH OF TEXT, FIELD_A: 12, FIELD_B: 0.2321, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656
The order of fields is always the same, but some fields may be absent. There can be other fields between fields of interest, for example comparing to the line above I can get the following as well:
SOME BUNCH OF TEXT, FIELD_A: 12, NOT_INTERESTED: 235, FIELD_B: 0.2321, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656, FIELDS
As the result of processing this text I want to have clean CSV file with my fields specified one after another:
12,0.2321,12:10:08 2011/07/22,656
If some field is absent then I would like to simple omit value (for example FIELD_B was absent):
12,,12:10:08 2011/07/22,656
How can I do this using commands like sed, perl or awk ?
I tried extracting single field with perl -pe 's/^.*?(FIELD_A: (.*?),)?.*?$/\2/'
and failed - regex simply ignores my field even if it presents
Upvotes: 3
Views: 252
Reputation: 91385
How about this way (assuming fileds names are known) :
#!/usr/bin/perl
use strict;
use warnings;
my @f = qw(FIELD_A FIELD_B FIELD_C FIELD_D);
while(my $line = <DATA>) {
chomp $line;
my @r;
for(@f) {
if ($line =~ /$_:\s*([^,]+)/) {
push @r, $1;
} else {
push @r,'';
}
}
print join(',',@r), "\n";
}
__DATA__
SOME BUNCH OF TEXT, FIELD_A: 12, FIELD_B: 0.2321, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656
SOME BUNCH OF TEXT, FIELD_A: 12, NOT_INTERESTED: 235, FIELD_B: 0.2321, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656, FIELDS
SOME BUNCH OF TEXT, FIELD_A: 12, NOT_INTERESTED: 235, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656, FIELDS
output:
12,0.2321,12:10:08 2011/07/22,656
12,0.2321,12:10:08 2011/07/22,656
12,,12:10:08 2011/07/22,656
Upvotes: 1
Reputation: 274542
You can use awk
with an associative array as shown below. Loop over the fields and split them on :
. Then store the key-value pair into an associative array. Finally print out the fields you want.
awk -F, '{
split("",arr)
for(i=1; i<=NF; i++){
a=index($i, ":")
if(a != 0){
# split on first colon to get key-value pair
key=substr($i,1,a-1)
val=substr($i,a+1)
# remove leading spaces from key and value
gsub(/^ */,"",key)
gsub(/^ */,"",val)
# store in an associative array
arr[key]=val
}
}
# print out the desired fields
print arr["FIELD_A"]","arr["FIELD_B"]","arr["FIELD_C"]","arr["FIELD_D"]
}' data.txt
Upvotes: 2