N. J
N. J

Reputation: 480

Using AWK to process CSV file with rows containing an array

Trying to process a CSV file using AWK, however I have met a problem that many of my cells in my row already contain comma ,, meaning I can not separate field using awk -F,.

CSV FILE

Name,...DATE,COLUMNX,ADDRESSES
host1,...,NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host2,...,NOV 24, 2022,['Element3'],"['192.168.x.101', 'fe80:XX']"

The ... represents rows/columns containing [, ,, ', "

What I have tried:
awk -F, '{print $X}'
This give me following output:

'Element2']
"['192.168.x.101'

What I want to accomplish:

host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101

Upvotes: 1

Views: 206

Answers (4)

Ed Morton
Ed Morton

Reputation: 203532

Using GNU awk for FPAT:

$ cat tst.awk
BEGIN {
    FPAT = "([^,]*)|(\"[^\"]+\")|([[][^]]+])"
}
NR > 1 {
    n = split($NF,a,/\047/)
    for ( i=2; i<=n; i+=4 ) {
        print $1, a[i]
    }
}

$ awk -f tst.awk file
host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101

To see how FPAT is splitting the input into comma-separated fields and then split() is splitting the last field into '-separated subfields, add some prints, e.g.:

$ cat tst.awk
BEGIN {
    FPAT = "([^,]*)|(\"[^\"]+\")|([[][^]]+])"
}
NR > 1 {
    print "============="
    print
    for ( i=1; i<=NF; i++ ) {
        print i, "<" $i ">"
    }
    n = split($NF,a,/\047/)
    for ( i=1; i<=n; i++ ) {
        print "\t" NF "." i, "<" a[i] ">"
    }
    for ( i=2; i<=n; i+=4 ) {
        print $1, a[i]
    }
}

$ awk -f tst.awk file
=============
host1,...,NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
1 <host1>
2 <...>
3 <NOV 24>
4 < 2022>
5 <['Element1', 'Element2']>
6 <"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]">
        6.1 <"[>
        6.2 <192.168.x.99>
        6.3 <, >
        6.4 <fe80:XX>
        6.5 <,>
        6.6 <192.168.x.100>
        6.7 <, fe80:XX]">
host1 192.168.x.99
host1 192.168.x.100
=============
host2,...,NOV 24, 2022,['Element3'],"['192.168.x.101', 'fe80:XX']"
1 <host2>
2 <...>
3 <NOV 24>
4 < 2022>
5 <['Element3']>
6 <"['192.168.x.101', 'fe80:XX']">
        6.1 <"[>
        6.2 <192.168.x.101>
        6.3 <, >
        6.4 <fe80:XX>
        6.5 <]">
host2 192.168.x.101

See What's the most robust way to efficiently parse CSV using awk? for more information on FPAT and parsing CSVs with awk.

Upvotes: 0

Dave Pritlove
Dave Pritlove

Reputation: 2687

Modern versions of awk allow a record to split at more than one field separator. Thus, each line can be split at both commas and single quote marks to isolate the data you need.

To use ' as a field separator along with , requires the former to be escaped and it can be quite tricky to then combine the two. The simplest way I've found after a few trials is to use the shell F switch with a regular expression including the escaped ' and ,. It's messy as you have to close the first single quote before escaping the required one and re-opening a single-quotted command: -F'[,'\''=]' (I generally prefer setting field separators within the awk procedure but this one defeated me).

This edited version works to isolate the field (change $35 to suit by trail-and error):

awk -F'[,'\'']' 'NR>1{print $1" "$35}' data.csv

I tested the above on the following test file:

data.csv:
Name,...DATE,COLUMNX,ADDRESSES
host1,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host2,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.168.xxx.yy', 'fe80:XX','192.168.x.100', fe80:XX]"
host3,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.xxx.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host4,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['xxx.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"

output:

host1 192.168.x.99
host2 192.168.xxx.yy
host3 192.xxx.x.99
host4 xxx.168.x.99

Upvotes: 0

j_b
j_b

Reputation: 2020

Using awk:

awk -F",\"|\"$" 'NR>1 { \
gsub(/\047|[\[\]]/,""); \
split($2,a,", "); \
split($1,h,","); \
for (n in a) {if (a[n] ~ /^[0-9]/) printf "%s %s\n", h[1], a[n]}}' src.csv

Output:

host1 192.168.x.100
host1 192.168.x.99
host2 192.168.x.101

Details:

-F",\"|\"$" (split on ," or " at end of record (will remove trailing double quote and each record will be split into two fields.

gsub(/\047|[\[\]]/,""); (sanitize by removing both single quotes and/or brackets)

split($2,a,", "); (split second field into array a on comma-space)

split($1,h,","); (split first field into array h on comma.

for (n in a) {if (a[n] ~ /^[0-9]/) printf "%s %s\n", h[1], a[n] Iterate over array a and only print output if array item starts with a number

Upvotes: 0

Andre Wildberg
Andre Wildberg

Reputation: 19088

I'd recommend a proper CSV parser to do the job, then use awk to do the regex, e.g.

$ ruby -r 'csv' -ne 'lines=$_
  CSV.parse(lines) do |i| 
    i.each do |j| 
      printf("%s ", j)
    end
    puts ""
  end' file | 
awk '{gsub(/\[\047|\047\]|\047|\]|,/, "", $0)}
  /^host/{for(i=1;i<=NF;i++){if($i~/^[0-9]+\.+/){print $1, $i}}}'
host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101

Upvotes: 1

Related Questions