SJU
SJU

Reputation: 187

Extract columns from a CSV file using Linux shell commands

I need to "extract" certain columns from a CSV file. The list of columns to extract is long and their indices do not follow a regular pattern. So far I've come up with a regular expression for a comma-separated value but I find it frustrating that in the RHS side of sed's substitute command I cannot reference more than 9 saved strings. Any ideas around this?

Note that comma-separated values that contain a comma must be quoted so that the comma is not mistaken for a field delimiter. I'd appreciate a solution that can handle such values properly. Also, you can assume that no value contains a new line character.

Upvotes: 2

Views: 2635

Answers (4)

Ed Morton
Ed Morton

Reputation: 203209

With GNU awk:

$ cat file
a,"b,c",d,e

$ awk -vFPAT='([^,]*)|("[^"]+")' '{print $2}' file
"b,c"
$ awk -vFPAT='([^,]*)|("[^"]+")' '{print $3}' file
d

$ cat file
a,"b,c",d,e,"f,g,h",i,j

$ awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, -vcols="1,5,7,2" 'BEGIN{n=split(cols,a,/,/)} {for (i=1;i<=n;i++) printf "%s%s", $(a[i]), (i<n?OFS:ORS)}' file
a,"f,g,h",j,"b,c"

See http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content for details. I doubt if it'd handle escaped double quotes embedded in a field, e.g. a,"b""c",d or a,"b\"c",d.

See also What's the most robust way to efficiently parse CSV using awk? for how to parse CSVs with awk in general.

Upvotes: 3

elixon
elixon

Reputation: 1292

There is command-line csvtool available - https://colin.maudry.com/csvtool-manual-page/

# apt-get install csvtool

Upvotes: 0

Chris Koknat
Chris Koknat

Reputation: 3451

Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.
I provided sample code within my answer here: parse csv file using gawk

Upvotes: 0

hek2mgl
hek2mgl

Reputation: 157947

CSV is not that easy to parse like it might look in the first place. This is because there can be a plenty of different delimiters or fixed column widths to separate the data, and also the data may contain the delimiter itself (escaped).

Like I already told here I would use a programming language which supports a CVS library for that.

Use

Upvotes: 1

Related Questions