Sol
Sol

Reputation: 1005

How to bulk load list values into DuckDB

I have a CSV file that looks like this:

W123456,{A123,A234,A345}
W2345567,{A789,A678,A543}

I have python code that tries to load this csv file:

import duckdb
con = duckdb.connect(database='mydb.duckdb', read_only=False)
con.execute('CREATE TABLE mytable(field1 VARCHAR, field2 VARCHAR[])')
con.execute("COPY mytable FROM 'my.csv'")

I get this error:

Traceback (most recent call last):

File "myfile.py", line 5, in <module>

con.execute("COPY mytable FROM 'my.csv'")

RuntimeError: Invalid Input Error: Error on line 1: expected 2 values per row, but got more. (DELIMITER=',' (default), QUOTE='"' (default), ESCAPE='"' (default), HEADER=0' (default), SAMPLE_SIZE=10240, IGNORE_ERRORS=0, ALL_VARCHAR=0)

I've tried quoting the strings in the list but no luck.

What am I missing?

Upvotes: 1

Views: 2905

Answers (2)

John Watson
John Watson

Reputation: 1

Using ; instead of , worked for me.

Looks like when duckdb sees , it tries to interpret the following element as data for the next column.

So using a separator other than , should work.

Upvotes: 0

Introduction

Let's consider DuckDB 0.4.0 (duckdb pip package 0.4.0) as the current version.

CSV field value escaping

Since the second field value contains comma, it is necessary to escape it by enclosing in double quotes.

A slightly reformatted excerpt from RFC 4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files.

  1. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:

    "aaa","b CRLF
    bb","ccc" CRLF
    zzz,yyy,xxx
    

Therefore, here are the corrected records:

W123456,"{A123,A234,A345}"
W2345567,"{A789,A678,A543}"

DuckDB: List data type and CSV format

Please, note that you are using the list data type: DuckDB - List:

field2 VARCHAR[]

Could not find the documentation on how the list data type values are represented in the CSV format.
To find it out, it was decided to save the table records to a CSV file and then to load it back, performing both operations by using the COPY statement.

Step #1. Save table records in CSV file

Let's start from the «empty» database: please, remove (or move) the mydb.duckdb file.

Python script:

import duckdb
con = duckdb.connect(database='mydb.duckdb', read_only=False)
con.execute('CREATE TABLE mytable(field1 VARCHAR, field2 VARCHAR[])')
con.execute("INSERT INTO mytable VALUES ('1', LIST_VALUE('1', '2', '3'))")
con.execute("COPY mytable TO 'output.csv' (HEADER, DELIMITER ',')")

Output file (output.csv):

field1,field2
1,"[1, 2, 3]"

Step #2. Load CSV file records to table

Python script:

import duckdb
con = duckdb.connect(database='mydb.duckdb', read_only=False)
con.execute("COPY mytable FROM 'output.csv' (HEADER, DELIMITER ',')")

Output:

>>> con.execute("COPY mytable FROM 'output.csv' (HEADER, DELIMITER ',')")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Invalid Input Error: Unimplemented type for cast (VARCHAR -> VARCHAR[]) between line 1 and 2 in column 1. Parser options: DELIMITER=',', QUOTE='"' (default), ESCAPE='"' (default), HEADER=1, SAMPLE_SIZE=10240, IGNORE_ERRORS=0, ALL_VARCHAR=0 

It looks like the feature (loading list data type values from a CSV file to a table) is not implemented.

It seems that there is a related GitHub issue: List Type: RuntimeError: Conversion Error: Unimplemented type for cast (VARCHAR -> LIST) · Issue #2698 · duckdb/duckdb.

Maybe, it is worth reporting a separate GitHub issue.

Upvotes: 1

Related Questions