Reputation: 480
Trying to process a CSV file using AWK, however I have met a problem that many of my cells in my row already contain comma ,
, meaning I can not separate field using awk -F,
.
CSV FILE
Name,...DATE,COLUMNX,ADDRESSES
host1,...,NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host2,...,NOV 24, 2022,['Element3'],"['192.168.x.101', 'fe80:XX']"
The ...
represents rows/columns containing [
, ,
, '
, "
What I have tried:
awk -F, '{print $X}'
This give me following output:
'Element2']
"['192.168.x.101'
What I want to accomplish:
host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101
Upvotes: 1
Views: 206
Reputation: 203532
Using GNU awk for FPAT
:
$ cat tst.awk
BEGIN {
FPAT = "([^,]*)|(\"[^\"]+\")|([[][^]]+])"
}
NR > 1 {
n = split($NF,a,/\047/)
for ( i=2; i<=n; i+=4 ) {
print $1, a[i]
}
}
$ awk -f tst.awk file
host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101
To see how FPAT
is splitting the input into comma-separated fields and then split()
is splitting the last field into '
-separated subfields, add some prints, e.g.:
$ cat tst.awk
BEGIN {
FPAT = "([^,]*)|(\"[^\"]+\")|([[][^]]+])"
}
NR > 1 {
print "============="
print
for ( i=1; i<=NF; i++ ) {
print i, "<" $i ">"
}
n = split($NF,a,/\047/)
for ( i=1; i<=n; i++ ) {
print "\t" NF "." i, "<" a[i] ">"
}
for ( i=2; i<=n; i+=4 ) {
print $1, a[i]
}
}
$ awk -f tst.awk file
=============
host1,...,NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
1 <host1>
2 <...>
3 <NOV 24>
4 < 2022>
5 <['Element1', 'Element2']>
6 <"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]">
6.1 <"[>
6.2 <192.168.x.99>
6.3 <, >
6.4 <fe80:XX>
6.5 <,>
6.6 <192.168.x.100>
6.7 <, fe80:XX]">
host1 192.168.x.99
host1 192.168.x.100
=============
host2,...,NOV 24, 2022,['Element3'],"['192.168.x.101', 'fe80:XX']"
1 <host2>
2 <...>
3 <NOV 24>
4 < 2022>
5 <['Element3']>
6 <"['192.168.x.101', 'fe80:XX']">
6.1 <"[>
6.2 <192.168.x.101>
6.3 <, >
6.4 <fe80:XX>
6.5 <]">
host2 192.168.x.101
See What's the most robust way to efficiently parse CSV using awk? for more information on FPAT and parsing CSVs with awk.
Upvotes: 0
Reputation: 2687
Modern versions of awk allow a record to split at more than one field separator. Thus, each line can be split at both commas and single quote marks to isolate the data you need.
To use '
as a field separator along with ,
requires the former to be escaped and it can be quite tricky to then combine the two. The simplest way I've found after a few trials is to use the shell F
switch with a regular expression including the escaped '
and ,
. It's messy as you have to close the first single quote before escaping the required one and re-opening a single-quotted command: -F'[,'\''=]'
(I generally prefer setting field separators within the awk procedure but this one defeated me).
This edited version works to isolate the field (change $35 to suit by trail-and error):
awk -F'[,'\'']' 'NR>1{print $1" "$35}' data.csv
I tested the above on the following test file:
data.csv:
Name,...DATE,COLUMNX,ADDRESSES
host1,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host2,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.168.xxx.yy', 'fe80:XX','192.168.x.100', fe80:XX]"
host3,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.xxx.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host4,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['xxx.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
output:
host1 192.168.x.99
host2 192.168.xxx.yy
host3 192.xxx.x.99
host4 xxx.168.x.99
Upvotes: 0
Reputation: 2020
Using awk
:
awk -F",\"|\"$" 'NR>1 { \
gsub(/\047|[\[\]]/,""); \
split($2,a,", "); \
split($1,h,","); \
for (n in a) {if (a[n] ~ /^[0-9]/) printf "%s %s\n", h[1], a[n]}}' src.csv
Output:
host1 192.168.x.100
host1 192.168.x.99
host2 192.168.x.101
Details:
-F",\"|\"$"
(split on ," or " at end of record (will remove trailing double quote and each record will be split into two fields.
gsub(/\047|[\[\]]/,"");
(sanitize by removing both single quotes and/or brackets)
split($2,a,", ");
(split second field into array a
on comma-space)
split($1,h,",");
(split first field into array h
on comma.
for (n in a) {if (a[n] ~ /^[0-9]/) printf "%s %s\n", h[1], a[n]
Iterate over array a
and only print output if array item starts with a number
Upvotes: 0
Reputation: 19088
I'd recommend a proper CSV parser to do the job, then use awk
to do the regex, e.g.
$ ruby -r 'csv' -ne 'lines=$_
CSV.parse(lines) do |i|
i.each do |j|
printf("%s ", j)
end
puts ""
end' file |
awk '{gsub(/\[\047|\047\]|\047|\]|,/, "", $0)}
/^host/{for(i=1;i<=NF;i++){if($i~/^[0-9]+\.+/){print $1, $i}}}'
host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101
Upvotes: 1