Reputation: 1391
I have a csv file with over 5k fields/columns with header names. I would like to import only some specific fields to my database.
I am using local infile for other smaller files which need to be imported
LOAD DATA
LOCAL INFILE 'C:/wamp/www/imports/new_export.csv'
INTO TABLE table1
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(colour,shape,size);
Assigning dummy variables for columns to skip might be cumbersome, Also I would prefer to reference using the fields headers to future proof in case the file has additional fields
I am considering using awk on the file before loading the file to the database. But the examples I have found in search don't seem to work.
Any suggestions on best approach for this would be appreciated.
Upvotes: 2
Views: 391
Reputation: 360103
This is similar to MvG's answer, but it doesn't require gawk
4 and thus uses -F
as suggested in that answer. It also shows a technique for listing the desired fields and iterating over the list. This may make the code easier to maintain if there is a large list.
#!/usr/bin/awk -f
BEGIN {
col_list = "colour shape size" # continuing with as many as desired for output
num_cols = split(col_list, cols)
FS = OFS = ","
}
NR==1 {
for (i = 1; i <= NF; i++) {
p[$i] = i # remember column for name
}
# next # enable this line to suppress headers.
}
{
delim = ""
for (i = 1; i <= num_cols; i++) {
printf "%s%s", delim, $p[cols[i]]
delim = OFS
}
printf "\n"
}
Upvotes: 1
Reputation: 60868
Here is a full-featured solution which can deal with all kinds of quotes and commas in the values of the csv table, and can extract columns by name. It requires gawk
and is based on the FPAT
feature suggested in this answer.
BEGIN {
# Allow simple values, quoted values and even doubled quotes
FPAT="\"[^\"]*(\"\"[^\"]*)*\"|[^,]*"
}
NR==1 {
for (i = 1; i <= NF; i++) {
p[$i]=i # remember column for name
}
# next # enable this line to suppress headers.
}
{
print $p["colour"] "," $p["shape"] "," $p["size"]
}
Write this to a file, to be invoked by gawk -f file.awk
.
As the column-splitting and the index-by-header features are kind of orthogonal, you could use part of the script on non-GNU awk to select by column name, not using FPAT
but simple -F,
instead.
Upvotes: 0
Reputation: 60868
Does your actual data have any commas? If not, you might be best served using cut
:
cut -d, -f1,2,5,8-12
will select the named fields, splitting lines at the ,
. If any of your "
-enclosed text fields does contain a ,
, things will break, as cut
doesn't know about "
.
Upvotes: 0