pebble unit
pebble unit

Reputation: 1182

Unable to merge two columns into one column in awk

I have has a csv in the format name,id,logindate where logindates appear as "July 15, YYYY HH:mm:ss" ie abc,123,"July 15, YYYY HH:mm:ss". Please note that there are headers and other information that should be skipped in the first 5 lines. So a sample csv file may look like:

AuditReport  
asdf  
qwerty  
asdf  
name, id, logindate   
experiment,182002, "July 31, 2022 20:00:00"  
unit 1998,183065, "July 3, 2022 21:00:00"  
asdf, 202065, "May 25, 2022 20:00:00" 

For my output, I would like to get the following (headers are removed):

experiment,182002, "July 31 2022 20:00:00"  
unit 1998,183065, "July 3 2022 21:00:00" 
asdf, 202065, "May 25 2022 20:00:00"

My main task is to be able to parse commas properly even with one being included in the string

After much google searching and going through several SO questions, I come to the conclusion that using a csv parser, some other language, or even GNU awk (using FPAT) is a better tool for this, but I m told that the production server in the company my dad works in uses awk and is assumed to not be gawk. (i m doing random small odd tasks to prepare myself for finding a job)

I m trying to workaround this by removing the "" and parsing by FS="," then concatenating the last two columns together again. However, my output keeps giving me 4 columns (unable to concatenate last two columns together into one column)

my code is:

/usr/bin/env awk {BEGIN{FS=","} NR>5 {print}' sample.csv | awk '{ gsub("\"", "") } { $1=$1 } 1' | awk '{ print $1, $2, $3" "$4 }' > test.csv

I also tried the following:
https://stackoverflow.com/a/48386788/16034206
awk '{$2=$2"-"$3;$3=""} 1' Input_file
In my case:
/usr/bin/env awk {BEGIN{FS=","} NR>5 {print}' sample.csv | awk '{ gsub("\"", "") } { $1=$1 } 1' | awk '{ $3=$3" "$4, $4=""} 1' > test.csv

Upvotes: 2

Views: 156

Answers (3)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2801

you don't have to worry about comma parsing once you've detected your header row by setting

  • FS = "^$"
echo "${_input_data_}" | 

mawk '_+=(!__<NF)*NR { FS="^$" } _<NR' FS=','
experiment,182002, "July 31, 2022 20:00:00"  
unit 1998,183065, "July 3, 2022 21:00:00"  
asdf, 202065, "May 25, 2022 20:00:00"

Upvotes: 0

RavinderSingh13
RavinderSingh13

Reputation: 133518

With your shown samples please try following awk code. Written and tested in GNU awk, should work in any POSIX awk version. Using awk's match function to use regex "[a-zA-Z]+ [0-9]{1,2}, [0-9]{4} [0-9]{2}(:[0-9]{2}){2}[^"]*" in it to get the expected results as per OP's request. Then using substr(to get sub strings) while printing the values.

awk '
match($0,/"[a-zA-Z]+ [0-9]{1,2}, [0-9]{4} [0-9]{2}(:[0-9]{2}){2}[^"]*"/){
  val=substr($0,RSTART+1,RLENGTH-2)
  gsub(/,/,"",val)
  print substr($0,1,RSTART-1) val substr($0,RSTART+RLENGTH)
}
' Input_file

OR in case you want to keep " before and after of needed values then a slight change in above code will do it:

awk '
match($0,/"[a-zA-Z]+ [0-9]{1,2}, [0-9]{4} [0-9]{2}(:[0-9]{2}){2}[^"]*"/){
  val=substr($0,RSTART,RLENGTH)
  gsub(/,/,"",val)
  print substr($0,1,RSTART-1) val substr($0,RSTART+RLENGTH)
}
'  Input_file

Here is the Online demo for used regex in above code for its understanding purposes.

Upvotes: 3

Ed Morton
Ed Morton

Reputation: 203522

Whenever you find yourself piping the output of 1 call to awk to the input of another call to awk you're almost always going down the wrong path.

Using any awk:

$ awk 'BEGIN{FS=OFS="\""} NR>5{for (i=2; i<=NF; i+=2) gsub(/ *, */," ",$i); print}' sample.csv
experiment,182002, "July 31 2022 20:00:00"
unit 1998,183065, "July 3 2022 21:00:00"
asdf, 202065, "May 25 2022 20:00:00"

For a general way to handle more complicated CSVs using any awk, see What's the most robust way to efficiently parse CSV using awk?.

Upvotes: 3

Related Questions