Polars Reading Files Causing Errors

Question

Often when reading messy csv files I end up seeing different kind of errors due to inconsistency of the data types in the column, for instance:

ComputeError: Could not parse `22.4` as dtype Int64 at column 59.
The current offset in the file is 433793 bytes.

When the file/data is not yet familiar, I likely do not know what is the name of the column at 59th position. I'm asking for advice for more efficient process than what I'm currently doing to overcome these kind of issues:

1 - First I read the file with the reader option set to 'infer_schema_length=0' (which reads the data in pl.Utf8 string format). Another option is to use 'ignore_erros = True', but to my understanding it convers the error values to nulls, which is often what I don't want.

2 - As I don't know yet which is the 59th column, I do a for loop to figure it out

for i in enumerate(df.columns):
print(i)

3 - Once I figured the column name raising the error, then I'll filter the dataframe to find that specific value to identify on which row(s) it appears on:

(pl
    .read_csv(file="file_name.csv", infer_schema_length=0)
    .with_row_count()
    .select(
        [
            pl.col("row_nr"),
            pl.col("Error Column Name")
        ])
    .filter(pl.col("Error Column Name") == "22.4")
)

Output:

shape: (1, 2)
┌────────┬───────────────────┐
│ row_nr ┆ Error Column Name │
│ ---    ┆ ---               │
│ u32    ┆ str               │
╞════════╪═══════════════════╡
│ 842    ┆ 22.4              │
└────────┴───────────────────┘

4 - Then depending on the file and case, but I would adjust the value to what it should be "224" or "22" or "23" either in the source of the file or modifying DF and converting all other column datatypes to desired ones.

Questions:

Is there a easier way to access nth column in Polars than what I do in step 2?
Is there a more optimal way of overcoming the values causing the errors?
If I read the file and columns as pl.Utf8 and adjust the value causing the error, is there a convenient way to automatically detect the best datatypes for the df's columns after the data has been read rather than manually going column by column?

Polars Reading Files Causing Errors

Answers (1)

Related Questions