Reputation: 139
I've got a rather large CSV file that I need to occasionally import into a PostgreSQL database, but it virtually always contains too many errors to import using the SQL COPY command. I've managed to fix the issue with quotation marks being in places where they shouldn't, by using this awk command:
awk -F\" 'NF % 2 == 1 { print $0 }' ./db_downloaded.csv > ./db_sanitized.csv
This drops any line that has an odd number of quotation marks on it, and works quite well. The other common problem I run into is with an incorrect number of columns. My current solution to this problem is to use the following awk command to drop any line that has the wrong number of commas:
awk -F"," 'NF == 40 { print $0 }' ./db_sanitized.csv > ./db_sanitized2.csv
However, this makes it drop any rows that have commas between quotes, which is actually quite a few rows. Now, it's not that critical that every single row be imported, but I'd really like it if I had a way to count only commas that didn't appear between quotation marks on a row. The only way I can think to do this is to have some sort of boolean variable that flips every time there's a quotation mark in the stream, which would turn off checking for commas, but I'm not sure where I would start if I was going to do that.
I'm mostly interested in using linux commands like grep, awk, tr, etc. that I can run on a remote server, rather than writing my own C++ program, but I may have to go that route if there's no other way.
Edit: I ended up using this ruby script:
lines = File.new("/home/db/product_list.csv", "r")
sanitized = File.new("/home/db/product_list_sanitized.csv", "w")
lines.each do |l|
if l.count("\"") % 2 == 0
if l.count(",") > 39
u = true
commas = 0
l.each_char do |c|
if (c == "\"")
u = (not u)
elsif (u && (c == ","))
commas += 1
end
end
sanitized.write(l) if commas == 39
else
sanitized.write(l)
end
end
end
sanitized.close
Upvotes: 2
Views: 1567
Reputation: 56049
Stash the original line, replace all sets of quotes with a more well-behaved token, and check against that; print the original if it matches:
awk -F, '{line=$0;gsub(/"[^"]*"/,"x")}NF%2{print line}' test.in
Upvotes: 2