Reputation: 2636
Say I have the following csv file:
id,message,time
123,"Sorry, This message
has commas and newlines",2016-03-28T20:26:39
456,"It makes the problem non-trivial",2016-03-28T20:26:41
I want to write a bash command that will return only the time column. i.e.
time
2016-03-28T20:26:39
2016-03-28T20:26:41
What is the most straight forward way to do this? You can assume the availability of standard unix utils such as awk, gawk, cut, grep, etc.
Note the presence of "" which escape , and newline characters which make trivial attempts with
cut -d , -f 3 file.csv
futile.
Upvotes: 28
Views: 31144
Reputation: 11
If you don't need to deal with escaped things in a string, or new lines in the string itself, this one should help you parse a line (result will be in array).
parse_csv_line() {
local line="$1"
local in_quotes=0
local field=""
array=()
for (( idx=0; idx<${#line}; idx++ )); do
local char="${line:$idx:1}"
if [ "$char" = "\"" ]; then
in_quotes=$((1 - in_quotes))
field+="$char"
elif [ "$char" = "," ] && [ $in_quotes = 0 ]; then
array+=("$field")
field=""
else
field+="$char"
fi
done
if [ ! -z "$field" ]; then
array+=("$field")
fi
}
Upvotes: 0
Reputation: 382860
csvcut
from csvkit
example
csvkit was mentioned at: https://stackoverflow.com/a/36288388/895245 but here's the example.
Install:
pip install csvkit
Sample CSV input file:
main.csv
a,"b
c",d
e,f
Get the first column:
csvcut -c 1 main.csv
which outputs:
a
e
or to get the second column:
csvcut -c 1 main.csv
which outputs the following valid CSV with a single column:
"b
c"
f
Or to swap the two columns around:
csvcut -c 2,1 main.csv
which outputs another valid CSV file:
"b
c",a
f,e
Tested on Ubuntu 23.04, csvkit==1.1.1.
csvtool
This is another good one. It is a compiled executable rather than Python script, so it can be much faster for large datasets. And the format
operation is useful if you want to print a single column without escapes, which can be useful so long as there are no multiline entries, and csvkit
does not seem to support.
Install:
sudo apt install csvtool
Sample usage:
printf 'a,"b,c",d\ne,"f""g",h\n' | csvtool format '%(2)\n' -
output:
b,c
f"g
See also: How to extract one column of a csv file
Tested on Ubuntu 23.10, csvtool 2.4-3.
Upvotes: 2
Reputation: 563
csvquote is designed for exactly this kind of thing. It santizes the file (reversibly) and allows awk to depend on commas being field separators and newlines being record separators.
Upvotes: 0
Reputation: 15293
I think you are overthinking it.
$: echo time; grep -Eo '[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}$' file
time
2016-03-28T20:26:39
2016-03-28T20:26:41
If you want to check for that comma just to be sure,
$: echo time; sed -En '/,[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}$/{ s/.*,//; p; }' file
time
2016-03-28T20:26:39
2016-03-28T20:26:41
Upvotes: 0
Reputation: 241868
Perl to the rescue! Use the Text::CSV_XS module to handle CSV.
perl -MText::CSV_XS=csv -we 'csv(in => $ARGV[0],
on_in => sub { $_[1] = [ $_[1][-1] ] })
' -- file.csv
csv
subroutine processes the csvin
specifies the input file, $ARGV[0]
contains the first command line argument, i.e. file.csv
hereon_in
specifies code to run. It gets the current row as the second argument, i.e. $_[1]
. We just set the whole row to the contents of the last column.Upvotes: 0
Reputation: 572
Vanilla bash script
save this code as parse_csv.sh, give it execution privilege (chmod +x parse_csv.sh)
#!/bin/bash
# vim: ts=4 sw=4 hidden nowrap
# @copyright Copyright © 2021 Carlos Barcellos <carlosbar at gmail.com>
# @license https://www.gnu.org/licenses/lgpl-3.0.en.html
if [ "$1" = "-h" -o "$1" = "--help" -o "$1" = "-v" ]; then
echo "parse csv 0.1"
echo ""
echo "parse_csv.sh [csv file] [delimiter]"
echo " csv file csv file to parse; default stdin"
echo " delimiter delimiter to use. default is comma"
exit 0
fi
delim=,
if [ $# -ge 1 ]; then
[ -n "$1" ] && file="$1"
[ -n "$2" -a "$2" != "\"" ] && delim="$2"
fi
processLine() {
if [[ ! "$1" =~ \" ]]; then
(
IFSS="$delim" fields=($1)
echo "${fields[@]}"
)
return 0
fi
under_scape=0
fields=()
acc=
for (( x=0; x < ${#1}; x++ )); do
if [ "${1:x:1}" = "${delim:0:1}" -o $((x+1)) -ge ${#1} ] && [ $under_scape -ne 1 ]; then
[ "${1:x:1}" != "${delim:0:1}" ] && acc="${acc}${1:x:1}"
fields+=($acc)
acc=
elif [ "${1:x:1}" = "\"" ]; then
if [ $under_scape -eq 1 ] && [ "${1:x+1:1}" = "\"" ]; then
acc="${acc}${1:x:1}"
else
under_scape=$((!under_scape))
fi
[ $((x+1)) -ge ${#1} ] && fields+=($acc)
else
acc="${acc}${1:x:1}"
fi
done
echo "${fields[@]}"
return 0
}
while read -r line; do
processLine "$line"
done < ${file:-/dev/stdin}
Then use: parse_csv.sh "csv file". To print only the last col, you can change the echo "${fields[@]}" to echo "${fields[-1]}"
Upvotes: 0
Reputation: 11355
As said here
gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' file.csv \
| awk -F, '{print $NF}'
To handle specifically those newlines that are in doubly-quoted strings and leave those alone that are outside them, using GNU awk
(for RT
):
gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' file
This works by splitting the file along "
characters and removing newlines in every other block.
Output
time
2016-03-28T20:26:39
2016-03-28T20:26:41
Then use awk to split the columns and display the last column
Upvotes: 7
Reputation: 1517
awk -F, '!/This/{print $NF}' file
time
2016-03-28T20:26:39
2016-03-28T20:26:41
Upvotes: -3
Reputation: 3684
I ran into something similar when attempting to deal with lspci -m output, but the embedded newlines would need to be escaped first (though IFS=, should work here, since it abuses bash' quote evaluation). Here's an example
f:13.3 "System peripheral" "Intel Corporation" "Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder" -r01 "Super Micro Computer Inc" "Device 0838"
And the only reasonable way I can find to bring that into bash is along the lines of:
# echo 'f:13.3 "System peripheral" "Intel Corporation" "Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder" -r01 "Super Micro Computer Inc" "Device 0838"' | { eval array=($(cat)); declare -p array; }
declare -a array='([0]="f:13.3" [1]="System peripheral" [2]="Intel Corporation" [3]="Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder" [4]="-r01" [5]="Super Micro Computer Inc" [6]="Device 0838")'
#
Not a full answer, but might help!
Upvotes: 0
Reputation: 67497
another awk
alternative using FS
$ awk -F'"' '!(NF%2){getline remainder;$0=$0 OFS remainder}
NR>1{sub(/,/,"",$NF); print $NF}' file
2016-03-28T20:26:39
2016-03-28T20:26:41
Upvotes: 0
Reputation: 157992
As chepner said, you are encouraged to use a programming language which is able to parse csv.
Here comes an example in python:
import csv
with open('a.csv', 'rb') as csvfile:
reader = csv.reader(csvfile, quotechar='"')
for row in reader:
print(row[-1]) # row[-1] gives the last column
Upvotes: 19
Reputation: 328604
CSV is a format which needs a proper parser (i.e. which can't be parsed with regular expressions alone). If you have Python installed, use the csv
module instead of plain BASH.
If not, consider csvkit which has a lot of powerful tools to process CSV files from the command line.
See also:
Upvotes: 6