Santosh Pillai
Santosh Pillai

Reputation: 1391

Extract specific fields from text file

I have a csv file with over 5k fields/columns with header names. I would like to import only some specific fields to my database.

I am using local infile for other smaller files which need to be imported

LOAD DATA
LOCAL INFILE 'C:/wamp/www/imports/new_export.csv'
INTO TABLE table1
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(colour,shape,size);

Assigning dummy variables for columns to skip might be cumbersome, Also I would prefer to reference using the fields headers to future proof in case the file has additional fields

I am considering using awk on the file before loading the file to the database. But the examples I have found in search don't seem to work.

Any suggestions on best approach for this would be appreciated.

Upvotes: 2

Views: 391

Answers (3)

Dennis Williamson
Dennis Williamson

Reputation: 360103

This is similar to MvG's answer, but it doesn't require gawk 4 and thus uses -F as suggested in that answer. It also shows a technique for listing the desired fields and iterating over the list. This may make the code easier to maintain if there is a large list.

#!/usr/bin/awk -f
BEGIN {
    col_list = "colour shape size" # continuing with as many as desired for output
    num_cols = split(col_list, cols)
    FS = OFS = ","
}

NR==1 {
    for (i = 1; i <= NF; i++) {
        p[$i] = i # remember column for name
    }
    # next # enable this line to suppress headers.
}

{
    delim = ""
    for (i = 1; i <= num_cols; i++) {
        printf "%s%s", delim, $p[cols[i]]
        delim = OFS
    }
    printf "\n"
}

Upvotes: 1

MvG
MvG

Reputation: 60868

Here is a full-featured solution which can deal with all kinds of quotes and commas in the values of the csv table, and can extract columns by name. It requires gawk and is based on the FPAT feature suggested in this answer.

BEGIN {
  # Allow simple values, quoted values and even doubled quotes
  FPAT="\"[^\"]*(\"\"[^\"]*)*\"|[^,]*"
}
NR==1 {
  for (i = 1; i <= NF; i++) {
    p[$i]=i # remember column for name
  }
  # next # enable this line to suppress headers.
}
{
  print $p["colour"] "," $p["shape"] "," $p["size"]
}

Write this to a file, to be invoked by gawk -f file.awk.

As the column-splitting and the index-by-header features are kind of orthogonal, you could use part of the script on non-GNU awk to select by column name, not using FPAT but simple -F, instead.

Upvotes: 0

MvG
MvG

Reputation: 60868

Does your actual data have any commas? If not, you might be best served using cut:

cut -d, -f1,2,5,8-12

will select the named fields, splitting lines at the ,. If any of your "-enclosed text fields does contain a ,, things will break, as cut doesn't know about ".

Upvotes: 0

Related Questions