How to bulk load list values into DuckDB

Question

I have a CSV file that looks like this:

W123456,{A123,A234,A345}
W2345567,{A789,A678,A543}

I have python code that tries to load this csv file:

import duckdb
con = duckdb.connect(database='mydb.duckdb', read_only=False)
con.execute('CREATE TABLE mytable(field1 VARCHAR, field2 VARCHAR[])')
con.execute("COPY mytable FROM 'my.csv'")

I get this error:

Traceback (most recent call last):

File "myfile.py", line 5, in 

con.execute("COPY mytable FROM 'my.csv'")

RuntimeError: Invalid Input Error: Error on line 1: expected 2 values per row, but got more. (DELIMITER=',' (default), QUOTE='"' (default), ESCAPE='"' (default), HEADER=0' (default), SAMPLE_SIZE=10240, IGNORE_ERRORS=0, ALL_VARCHAR=0)

I've tried quoting the strings in the list but no luck.

What am I missing?

Sergey Vyacheslavovich Brunov · Accepted Answer

Introduction

Let's consider DuckDB 0.4.0 (duckdb pip package 0.4.0) as the current version.

CSV field value escaping

Since the second field value contains comma, it is necessary to escape it by enclosing in double quotes.

A slightly reformatted excerpt from RFC 4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files.

Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx

Therefore, here are the corrected records:

W123456,"{A123,A234,A345}"
W2345567,"{A789,A678,A543}"

DuckDB: List data type and CSV format

Please, note that you are using the list data type: DuckDB - List:

field2 VARCHAR[]

Could not find the documentation on how the list data type values are represented in the CSV format.
To find it out, it was decided to save the table records to a CSV file and then to load it back, performing both operations by using the COPY statement.

Step #1. Save table records in CSV file

Let's start from the «empty» database: please, remove (or move) the mydb.duckdb file.

Python script:

import duckdb
con = duckdb.connect(database='mydb.duckdb', read_only=False)
con.execute('CREATE TABLE mytable(field1 VARCHAR, field2 VARCHAR[])')
con.execute("INSERT INTO mytable VALUES ('1', LIST_VALUE('1', '2', '3'))")
con.execute("COPY mytable TO 'output.csv' (HEADER, DELIMITER ',')")

Output file (output.csv):

field1,field2
1,"[1, 2, 3]"

Step #2. Load CSV file records to table

Python script:

import duckdb
con = duckdb.connect(database='mydb.duckdb', read_only=False)
con.execute("COPY mytable FROM 'output.csv' (HEADER, DELIMITER ',')")

Output:

>>> con.execute("COPY mytable FROM 'output.csv' (HEADER, DELIMITER ',')")
Traceback (most recent call last):
  File "", line 1, in 
RuntimeError: Invalid Input Error: Unimplemented type for cast (VARCHAR -> VARCHAR[]) between line 1 and 2 in column 1. Parser options: DELIMITER=',', QUOTE='"' (default), ESCAPE='"' (default), HEADER=1, SAMPLE_SIZE=10240, IGNORE_ERRORS=0, ALL_VARCHAR=0

It looks like the feature (loading list data type values from a CSV file to a table) is not implemented.

It seems that there is a related GitHub issue: List Type: RuntimeError: Conversion Error: Unimplemented type for cast (VARCHAR -> LIST) · Issue #2698 · duckdb/duckdb.

Maybe, it is worth reporting a separate GitHub issue.