pavel_kazlou
pavel_kazlou

Reputation: 2046

extract optional field values from lines

I have text in the form of separate lines, where each line has CSV-like format:

SOME BUNCH OF TEXT, FIELD_A: 12, FIELD_B: 0.2321, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656

The order of fields is always the same, but some fields may be absent. There can be other fields between fields of interest, for example comparing to the line above I can get the following as well:

SOME BUNCH OF TEXT, FIELD_A: 12, NOT_INTERESTED: 235, FIELD_B: 0.2321, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656, FIELDS

As the result of processing this text I want to have clean CSV file with my fields specified one after another:

12,0.2321,12:10:08 2011/07/22,656

If some field is absent then I would like to simple omit value (for example FIELD_B was absent):

12,,12:10:08 2011/07/22,656

How can I do this using commands like sed, perl or awk ? I tried extracting single field with perl -pe 's/^.*?(FIELD_A: (.*?),)?.*?$/\2/' and failed - regex simply ignores my field even if it presents

Upvotes: 3

Views: 252

Answers (2)

Toto
Toto

Reputation: 91385

How about this way (assuming fileds names are known) :

#!/usr/bin/perl
use strict;
use warnings;

my @f = qw(FIELD_A FIELD_B FIELD_C FIELD_D);
while(my $line = <DATA>) {
    chomp $line;
    my @r;
    for(@f) {
        if ($line =~ /$_:\s*([^,]+)/) {
            push @r, $1;
        } else {
            push @r,'';
        }
    }
    print join(',',@r), "\n";
}
__DATA__
SOME BUNCH OF TEXT, FIELD_A: 12, FIELD_B: 0.2321, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656
SOME BUNCH OF TEXT, FIELD_A: 12, NOT_INTERESTED: 235, FIELD_B: 0.2321, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656, FIELDS
SOME BUNCH OF TEXT, FIELD_A: 12, NOT_INTERESTED: 235, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656, FIELDS

output:

12,0.2321,12:10:08 2011/07/22,656
12,0.2321,12:10:08 2011/07/22,656
12,,12:10:08 2011/07/22,656

Upvotes: 1

dogbane
dogbane

Reputation: 274542

You can use awk with an associative array as shown below. Loop over the fields and split them on :. Then store the key-value pair into an associative array. Finally print out the fields you want.

awk -F, '{
 split("",arr)
 for(i=1; i<=NF; i++){
   a=index($i, ":")
   if(a != 0){
     # split on first colon to get key-value pair
     key=substr($i,1,a-1)
     val=substr($i,a+1)

     # remove leading spaces from key and value
     gsub(/^ */,"",key)
     gsub(/^ */,"",val)

     # store in an associative array
     arr[key]=val
   }   
 }
 # print out the desired fields
 print arr["FIELD_A"]","arr["FIELD_B"]","arr["FIELD_C"]","arr["FIELD_D"]
}' data.txt

Upvotes: 2

Related Questions