Reputation: 75
I have a large .csv file with 9 million rows. Some of these columns contain text with quotes or other special characters in them I would like to import from this .csv file into the database. For example I would like to import this row:
ID BH Units Name Type_building Year_cons
1 4 900.00 schoolgebouw "De Bolster Schoolgebouw 2014-01-01
As you can see there is a double quote in the fourth column. None of the values in the .csv file are quoted, but sometimes a double quote or backslash '\' appears in the text. When I try to upload the data using:
\COPY <tablename> FROM <path to file> WITH CSV DELIMITER ';' NULL '\N';
It gives an error message: ERROR value to long for type character varying(25). Apparently it sees the double quote as the start of a string and it tries to combine everything after it in the .csv file (including the fifth and sixth column) into a single cell (so that cell will contain 'De Bolster Schoolgebouw 2014-01-01'), which doesnt fit because the 'Name' column allows max 25 characters.
I found a similar topic (Is it possible to turn off quote processing in the Postgres COPY command with CSV format?) in which this solution was presented:
\COPY <tablename> FROM <path to file> WITH CSV DELIMITER ';' QUOTE E'\b' NULL '\N';
I think what it does is sets the quote value (default is double quote) to something else, in this case a backspace, so it won't recognize a double quote as a quote anymore. However when I run this I get another error: INVALID input syntax for integer.
What has happened is that every value now is quoted, so ID with value '1' becomes value '"1"' and because ID is defined as an integer it won't accept quotes.
Do you have any idea how to import double quotes and other special characters from a .csv file into a postgres database?
Thanks in advance!!
Upvotes: 2
Views: 1755
Reputation: 78513
Based on the error message, I'd be suspicious it has anything to do with double quoting or anything of the sort -- had it been so, it would have been a widely reported bug and fixed ages ago.
When it comes to Postgres, the error messages are almost always correct and helpful. As such, consider the very real possibility that there are more characters than meets the eye.
My own guess is that you've some trailing (or leading) spaces in there somewhere, and as such have pieces of data that look 24 characters long when viewed in a spreadsheet while being, in fact, longer.
If you don't, my second guess would be some kind of bizarro character sets conflicts or effects. Perhaps you've some double byte characters, or two single characters behaving as a single one due to a diacritic in there. These look fine in the viewer you're using for your data; but then when these get interpreted or viewed as utf8 they end up counting as two distinct characters. Unlikely imo, but possible (example).
Lastly and per Frank's suggestion, try removing the length constraint. It is only slowing you down as things stand, because it slows down inserts and is preventing you to move forward. Once done importing, re-add the constraint to the table's definition. You'll then be able to find the offending rows using the likes of:
select name from table where length(name) > 24;
... and upon fixing them, you'll be able to re-add your constraint if it serves any purpose. (Hint: it doesn't, or at the very least shouldn't have. There's a real person out there whose name is: "Kim-Jong Sexy Glorious Beast Divine Dick Father Lovely Iron Man Even Unique Poh Un Winn Charlie Ghora Khaos Mehan Hansa Kimmy Humbero Uno Master Over Dance Shake Bouti Bepop Rocksteady Shredder Kung Ulf Road House Gilgamesh Flap Guy Theo Arse Hole Im Yoda Funky Boy Slam Duck Chuck Jorma Jukka Pekka Ryan Super Air Ooy Rusell Salvador Alfons Molgan Akta Papa Long Nameh Ek.")
Upvotes: 1