Reputation: 1182
I have has a csv in the format name,id,logindate
where logindates appear as "July 15, YYYY HH:mm:ss"
ie abc,123,"July 15, YYYY HH:mm:ss"
. Please note that there are headers and other information that should be skipped in the first 5 lines. So a sample csv file may look like:
AuditReport
asdf
qwerty
asdf
name, id, logindate
experiment,182002, "July 31, 2022 20:00:00"
unit 1998,183065, "July 3, 2022 21:00:00"
asdf, 202065, "May 25, 2022 20:00:00"
For my output, I would like to get the following (headers are removed):
experiment,182002, "July 31 2022 20:00:00"
unit 1998,183065, "July 3 2022 21:00:00"
asdf, 202065, "May 25 2022 20:00:00"
My main task is to be able to parse commas properly even with one being included in the string
After much google searching and going through several SO questions, I come to the conclusion that using a csv parser, some other language, or even GNU awk (using FPAT) is a better tool for this, but I m told that the production server in the company my dad works in uses awk and is assumed to not be gawk. (i m doing random small odd tasks to prepare myself for finding a job)
I m trying to workaround this by removing the "" and parsing by FS="," then concatenating the last two columns together again. However, my output keeps giving me 4 columns (unable to concatenate last two columns together into one column)
my code is:
/usr/bin/env awk {BEGIN{FS=","} NR>5 {print}' sample.csv | awk '{ gsub("\"", "") } { $1=$1 } 1' | awk '{ print $1, $2, $3" "$4 }' > test.csv
I also tried the following:
https://stackoverflow.com/a/48386788/16034206
awk '{$2=$2"-"$3;$3=""} 1' Input_file
In my case:
/usr/bin/env awk {BEGIN{FS=","} NR>5 {print}' sample.csv | awk '{ gsub("\"", "") } { $1=$1 } 1' | awk '{ $3=$3" "$4, $4=""} 1' > test.csv
Upvotes: 2
Views: 156
Reputation: 2801
you don't have to worry about comma parsing once you've detected your header row by setting
FS = "^$"
echo "${_input_data_}" | mawk '_+=(!__<NF)*NR { FS="^$" } _<NR' FS=','
experiment,182002, "July 31, 2022 20:00:00"
unit 1998,183065, "July 3, 2022 21:00:00"
asdf, 202065, "May 25, 2022 20:00:00"
Upvotes: 0
Reputation: 133518
With your shown samples please try following awk
code. Written and tested in GNU awk
, should work in any POSIX awk
version. Using awk
's match
function to use regex "[a-zA-Z]+ [0-9]{1,2}, [0-9]{4} [0-9]{2}(:[0-9]{2}){2}[^"]*"
in it to get the expected results as per OP's request. Then using substr(to get sub strings) while printing the values.
awk '
match($0,/"[a-zA-Z]+ [0-9]{1,2}, [0-9]{4} [0-9]{2}(:[0-9]{2}){2}[^"]*"/){
val=substr($0,RSTART+1,RLENGTH-2)
gsub(/,/,"",val)
print substr($0,1,RSTART-1) val substr($0,RSTART+RLENGTH)
}
' Input_file
OR in case you want to keep "
before and after of needed values then a slight change in above code will do it:
awk '
match($0,/"[a-zA-Z]+ [0-9]{1,2}, [0-9]{4} [0-9]{2}(:[0-9]{2}){2}[^"]*"/){
val=substr($0,RSTART,RLENGTH)
gsub(/,/,"",val)
print substr($0,1,RSTART-1) val substr($0,RSTART+RLENGTH)
}
' Input_file
Here is the Online demo for used regex in above code for its understanding purposes.
Upvotes: 3
Reputation: 203522
Whenever you find yourself piping the output of 1 call to awk to the input of another call to awk you're almost always going down the wrong path.
Using any awk:
$ awk 'BEGIN{FS=OFS="\""} NR>5{for (i=2; i<=NF; i+=2) gsub(/ *, */," ",$i); print}' sample.csv
experiment,182002, "July 31 2022 20:00:00"
unit 1998,183065, "July 3 2022 21:00:00"
asdf, 202065, "May 25 2022 20:00:00"
For a general way to handle more complicated CSVs using any awk, see What's the most robust way to efficiently parse CSV using awk?.
Upvotes: 3