Reputation: 1005
I have a CSV file that looks like this:
W123456,{A123,A234,A345}
W2345567,{A789,A678,A543}
I have python code that tries to load this csv file:
import duckdb
con = duckdb.connect(database='mydb.duckdb', read_only=False)
con.execute('CREATE TABLE mytable(field1 VARCHAR, field2 VARCHAR[])')
con.execute("COPY mytable FROM 'my.csv'")
I get this error:
Traceback (most recent call last):
File "myfile.py", line 5, in <module>
con.execute("COPY mytable FROM 'my.csv'")
RuntimeError: Invalid Input Error: Error on line 1: expected 2 values per row, but got more. (DELIMITER=',' (default), QUOTE='"' (default), ESCAPE='"' (default), HEADER=0' (default), SAMPLE_SIZE=10240, IGNORE_ERRORS=0, ALL_VARCHAR=0)
I've tried quoting the strings in the list but no luck.
What am I missing?
Upvotes: 1
Views: 2905
Reputation: 1
Using ;
instead of ,
worked for me.
Looks like when duckdb sees ,
it tries to interpret the following element as data for the next column.
So using a separator other than ,
should work.
Upvotes: 0
Reputation: 18106
Let's consider DuckDB 0.4.0
(duckdb
pip package 0.4.0
) as the current version.
Since the second field value contains comma, it is necessary to escape it by enclosing in double quotes.
A slightly reformatted excerpt from RFC 4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files.
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:
"aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx
Therefore, here are the corrected records:
W123456,"{A123,A234,A345}"
W2345567,"{A789,A678,A543}"
Please, note that you are using the list data type: DuckDB - List:
field2 VARCHAR[]
Could not find the documentation on how the list data type values are represented in the CSV format.
To find it out, it was decided to save the table records to a CSV file and then to load it back, performing both operations by using the COPY
statement.
Let's start from the «empty» database: please, remove (or move) the mydb.duckdb
file.
Python script:
import duckdb
con = duckdb.connect(database='mydb.duckdb', read_only=False)
con.execute('CREATE TABLE mytable(field1 VARCHAR, field2 VARCHAR[])')
con.execute("INSERT INTO mytable VALUES ('1', LIST_VALUE('1', '2', '3'))")
con.execute("COPY mytable TO 'output.csv' (HEADER, DELIMITER ',')")
Output file (output.csv
):
field1,field2
1,"[1, 2, 3]"
Python script:
import duckdb
con = duckdb.connect(database='mydb.duckdb', read_only=False)
con.execute("COPY mytable FROM 'output.csv' (HEADER, DELIMITER ',')")
Output:
>>> con.execute("COPY mytable FROM 'output.csv' (HEADER, DELIMITER ',')")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: Invalid Input Error: Unimplemented type for cast (VARCHAR -> VARCHAR[]) between line 1 and 2 in column 1. Parser options: DELIMITER=',', QUOTE='"' (default), ESCAPE='"' (default), HEADER=1, SAMPLE_SIZE=10240, IGNORE_ERRORS=0, ALL_VARCHAR=0
It looks like the feature (loading list data type values from a CSV file to a table) is not implemented.
It seems that there is a related GitHub issue: List Type: RuntimeError: Conversion Error: Unimplemented type for cast (VARCHAR -> LIST) · Issue #2698 · duckdb/duckdb.
Maybe, it is worth reporting a separate GitHub issue.
Upvotes: 1