Reputation: 325
I've currently been trying to create an awk script that will take my .csv and convert it into a .tsv. I have managed to achieve this with the below script:
$ awk 'BEGIN { FS=","; OFS="\t" } {$1=$1; print}' file.csv > file.tsv
Example csv data
"visitor_date","country","browser","browser_version","id1","id2","id3","id4","true or false column"
"05/10/2019 00:55","United States","App","","692467","2163702","4117512086","format","0"
"05/10/2019 00:56","United States","SamsungBrowser","8","692467","2163702","4117512083","format","1"
"05/10/2019 01:08","United States","safari","11","692467","2163704","4117511299","format","1"
"05/10/2019 03:25","United States","SamsungBrowser","8","692467","2163704","4117511299","format",""
"05/10/2019 07:50","United States","chrome","58","692467","2163704","4117511299","format",""
"05/10/2019 09:18","United States","internet explorer","11","692467","2163702","4117512086","format","0"
However I've found that my data file tends to have all of the columns defined as a sting with the "", even when we have things that might be considered an integer.
My initial test was to try and remove the {","} from everything and then define things as strings on certain columns but I get an error when I try to quote it.
$ awk 'BEGIN { FS="",""; OFS="\t" } {$1=$1; print}' file.csv > file.tsv
Error
awk: cmd. line:1: BEGIN { FS="",""; OFS="\t" } {$1=$1; print}
awk: cmd. line:1: ^ syntax error
From the research I've done online all I seem to find is how you can change a string value in a cell to a different integer value. Is there a way that I can transform certain columns (not cells) to be treated as "integers" and "dates" instead of all as strings? Is what I'm trying to do with my initial test even the right direction I should be taking or is there a different way with AWK to decide what my columns are?
Desired tsv Output
"visitor_date" "country browser" "browser_version" "client_transaction_id" "id1" "id2" "id3" "id4" "true or false column"
05/10/2019 00:55 "United States" "App" null 692467 "2163702" 4117512086 "format" 0
05/10/2019 00:56 "United States" "SamsungBrowser" 8 "null" 692467 2163702 4117512083 "format" 1
05/10/2019 01:08 "United States" "safari" 11 null "692467" 2163704 4117511299 "format" 1
05/10/2019 03:25 "United States" "SamsungBrowser" 8 "null" 692467 2163704 4117511299 "format"
05/10/2019 07:50 "United States" "chrome" 58 null "692467" 2163704 4117511299 "format"
05/10/2019 09:18 "United States" "internet explorer" 11 "null" 692467 2163702 4117512086 "format" 0
Note: The "visitor_date" column would need to be a "date" format.
Apologies if this is something super obvious or basic. I've currently only got 2 days experience using an ubuntu interface on windows 10 and AWK so there is a lot I'm trying to take in.
Upvotes: 0
Views: 2687
Reputation: 5728
It's not awk, but it's very easy using Miller (https://github.com/johnkerl/miller). Running
mlr --c2t cat ./input.csv >./output.csv
You will have
visitor_date country browser browser_version id1 id2 id3 id4 true or false column
05/10/2019 00:55 United States App 692467 2163702 4117512086 format 0
05/10/2019 00:56 United States SamsungBrowser 8 692467 2163702 4117512083 format 1
05/10/2019 01:08 United States safari 11 692467 2163704 4117511299 format 1
05/10/2019 03:25 United States SamsungBrowser 8 692467 2163704 4117511299 format
05/10/2019 07:50 United States chrome 58 692467 2163704 4117511299 format
05/10/2019 09:18 United States internet explorer 11 692467 2163702 4117512086 format 0
Upvotes: 0
Reputation: 189936
If there are no double quotes you want to preserve, simply
awk 'BEGIN { FS=","; OFS="\t" } { gsub("\"", "") } { $1=$1 } 1' file.csv >file.tsv
Incidentally, this also demonstrates how to quote a literal double-quote character inside a double-quoted string; backslash-escape it.
If you only want to remove double quotes around some columns, you can limit the scope of the gsub
. For example,
... { gsub("\"", "", $1); gsub("\"", "", $3) } 1
to only manipulate columns 1 and 3. Then you can drop the { $1 = $1 }
because we are forcing Awk to recalculate its columns anyway (so the FS
gets replaced with OFS
).
To not manipulate the first line (which commonly contains the header, not data), add a condition:
... NR>1 { gsub(...) }1
The "replace everywhere" can be expressed quite succinctly in sed
, though it tends to be more of a write-only language.
sed -e 's/,/\t/g' -e 's/"//g' file.csv >file.tsv
Not all sed
dialects interpret \t
as a literal tab, and some are even allergic to multiple -e
options; but this should work fine on Ubuntu and generally Linux (other than possibly Busybox and other similar attempts to relive the constraints of the jolly 1970s).
Upvotes: 2