Reputation: 553
Often when reading messy csv files I end up seeing different kind of errors due to inconsistency of the data types in the column, for instance:
ComputeError: Could not parse `22.4` as dtype Int64 at column 59.
The current offset in the file is 433793 bytes.
When the file/data is not yet familiar, I likely do not know what is the name of the column at 59th position. I'm asking for advice for more efficient process than what I'm currently doing to overcome these kind of issues:
1 - First I read the file with the reader option set to 'infer_schema_length=0' (which reads the data in pl.Utf8 string format). Another option is to use 'ignore_erros = True', but to my understanding it convers the error values to nulls, which is often what I don't want.
2 - As I don't know yet which is the 59th column, I do a for loop to figure it out
for i in enumerate(df.columns):
print(i)
3 - Once I figured the column name raising the error, then I'll filter the dataframe to find that specific value to identify on which row(s) it appears on:
(pl
.read_csv(file="file_name.csv", infer_schema_length=0)
.with_row_count()
.select(
[
pl.col("row_nr"),
pl.col("Error Column Name")
])
.filter(pl.col("Error Column Name") == "22.4")
)
Output:
shape: (1, 2)
┌────────┬───────────────────┐
│ row_nr ┆ Error Column Name │
│ --- ┆ --- │
│ u32 ┆ str │
╞════════╪═══════════════════╡
│ 842 ┆ 22.4 │
└────────┴───────────────────┘
4 - Then depending on the file and case, but I would adjust the value to what it should be "224" or "22" or "23" either in the source of the file or modifying DF and converting all other column datatypes to desired ones.
Questions:
Upvotes: 4
Views: 9403
Reputation: 18691
From the docs the default infer_schema_length
is 100. That means if you have a column where the first 100 values are safely integers then it's going to assign the data type for that column as integer. When it gets to a float it'll error out rather than go back and reset the data type. Ironically, you took the direction of changing that default in the wrong direction to get the result you wanted.
So, instead of doing pl.read_csv(file="file_name.csv", infer_schema_length=0)
and then manually casting values, just do pl.read_csv(file="file_name.csv", infer_schema_length=int(1e10))
. If you still get the error then make 1e10
bigger or if it's too slow then make it smaller.
That answers your Q2.
Q1: You don't need to loop through all the values, you can just do df.columns[59]
to get column 59. Also, instead of reading in the whole df just to get column names you can do pl.read_csv("file_name.csv", n_rows=1)
which will take up a much smaller memory footprint
Q3: Yes/No. It depends on what you consider convenient and what is in the domain of formats of files you're opening. If you just want to have a hierarchy of try int, try float, keep utf then you could do:
for i in df.columns:
try:
df=df.with_column(pl.col(i).cast(pl.Int64()))
continue
except:
pass
try:
df=df.with_column(pl.col(i).cast(pl.Float64()))
continue
except:
pass
Upvotes: 2