virnovus
virnovus

Reputation: 139

Cleaning a CSV file using linux commands, while ignoring commas between quotes

I've got a rather large CSV file that I need to occasionally import into a PostgreSQL database, but it virtually always contains too many errors to import using the SQL COPY command. I've managed to fix the issue with quotation marks being in places where they shouldn't, by using this awk command:

awk -F\" 'NF % 2 == 1 { print $0 }' ./db_downloaded.csv > ./db_sanitized.csv

This drops any line that has an odd number of quotation marks on it, and works quite well. The other common problem I run into is with an incorrect number of columns. My current solution to this problem is to use the following awk command to drop any line that has the wrong number of commas:

awk -F"," 'NF == 40 { print $0 }' ./db_sanitized.csv > ./db_sanitized2.csv

However, this makes it drop any rows that have commas between quotes, which is actually quite a few rows. Now, it's not that critical that every single row be imported, but I'd really like it if I had a way to count only commas that didn't appear between quotation marks on a row. The only way I can think to do this is to have some sort of boolean variable that flips every time there's a quotation mark in the stream, which would turn off checking for commas, but I'm not sure where I would start if I was going to do that.

I'm mostly interested in using linux commands like grep, awk, tr, etc. that I can run on a remote server, rather than writing my own C++ program, but I may have to go that route if there's no other way.

Edit: I ended up using this ruby script:

lines = File.new("/home/db/product_list.csv", "r")
sanitized = File.new("/home/db/product_list_sanitized.csv", "w")

lines.each do |l|
  if l.count("\"") % 2 == 0
    if l.count(",") > 39
      u = true
      commas = 0
      l.each_char do |c|
        if (c == "\"")
          u = (not u)
        elsif (u && (c == ","))
          commas += 1
        end
      end
      sanitized.write(l) if commas == 39
    else
      sanitized.write(l)
    end
  end
end

sanitized.close

Upvotes: 2

Views: 1567

Answers (1)

Kevin
Kevin

Reputation: 56049

Stash the original line, replace all sets of quotes with a more well-behaved token, and check against that; print the original if it matches:

awk -F, '{line=$0;gsub(/"[^"]*"/,"x")}NF%2{print line}' test.in

Upvotes: 2

Related Questions