Reading a csv in Julia; CSV.TooManyColumnsError

Question

In pandas, when we are reading a csv file using the function pandas.read_csv we may set the keyword error_bad_lines = False which allows us to skip lines with too many fields and guarantee that a DataFrame object is returned. See the documentation here.

In Julia I am using CSV.read to read some data but no object is returned. Following the documentation I use CSV.validate to see what the problem is and I get CSV.TooManyColumnsError. So I was wondering if there is a similar keyword (to that of pandas) in Julia? More in general, what can be the way to overcome this error and get a DataFrame returned?

Bogumił Kamiński · Accepted Answer

Actually the way CSV.jl should behave by default is to read-in the data and drop the extra columns. Here is an example:

julia> using CSV, DataFrames

julia> println(read("x.txt", String))
a,b,c
1,2,3
4,5,6,7,8
1,2
1,2,3


julia> df = CSV.read("x.txt")
4×3 DataFrame
│ Row │ a      │ b      │ c       │
│     │ Int64⍰ │ Int64⍰ │ Int64⍰  │
├─────┼────────┼────────┼─────────┤
│ 1   │ 1      │ 2      │ 3       │
│ 2   │ 4      │ 5      │ 6       │
│ 3   │ 1      │ 2      │ missing │
│ 4   │ 1      │ 2      │ 3       │

So in short: over-long lines are not skipped, but truncated. And over-short lines (as you can see in the example) are filled with missing. But in all cases you should get the DataFrame object returned.

Of course CSV.validate should error on the first invalid line:

julia> CSV.validate("x.txt")
ERROR: CSV.TooManyColumnsError("row=2, col=3: expected 3 columns then a newline or EOF; parsed row: '4, 5, 6'")

Reading a csv in Julia; CSV.TooManyColumnsError

Answers (1)

Related Questions