miroslaavi
miroslaavi

Reputation: 553

Polars Reading Files Causing Errors

Often when reading messy csv files I end up seeing different kind of errors due to inconsistency of the data types in the column, for instance:

ComputeError: Could not parse `22.4` as dtype Int64 at column 59.
The current offset in the file is 433793 bytes.

When the file/data is not yet familiar, I likely do not know what is the name of the column at 59th position. I'm asking for advice for more efficient process than what I'm currently doing to overcome these kind of issues:

1 - First I read the file with the reader option set to 'infer_schema_length=0' (which reads the data in pl.Utf8 string format). Another option is to use 'ignore_erros = True', but to my understanding it convers the error values to nulls, which is often what I don't want.

2 - As I don't know yet which is the 59th column, I do a for loop to figure it out

for i in enumerate(df.columns):
print(i)

3 - Once I figured the column name raising the error, then I'll filter the dataframe to find that specific value to identify on which row(s) it appears on:

(pl
    .read_csv(file="file_name.csv", infer_schema_length=0)
    .with_row_count()
    .select(
        [
            pl.col("row_nr"),
            pl.col("Error Column Name")
        ])
    .filter(pl.col("Error Column Name") == "22.4")
)

Output:

shape: (1, 2)
┌────────┬───────────────────┐
│ row_nr ┆ Error Column Name │
│ ---    ┆ ---               │
│ u32    ┆ str               │
╞════════╪═══════════════════╡
│ 842    ┆ 22.4              │
└────────┴───────────────────┘

4 - Then depending on the file and case, but I would adjust the value to what it should be "224" or "22" or "23" either in the source of the file or modifying DF and converting all other column datatypes to desired ones.

Questions:

  1. Is there a easier way to access nth column in Polars than what I do in step 2?
  2. Is there a more optimal way of overcoming the values causing the errors?
  3. If I read the file and columns as pl.Utf8 and adjust the value causing the error, is there a convenient way to automatically detect the best datatypes for the df's columns after the data has been read rather than manually going column by column?

Upvotes: 4

Views: 9403

Answers (1)

Dean MacGregor
Dean MacGregor

Reputation: 18691

From the docs the default infer_schema_length is 100. That means if you have a column where the first 100 values are safely integers then it's going to assign the data type for that column as integer. When it gets to a float it'll error out rather than go back and reset the data type. Ironically, you took the direction of changing that default in the wrong direction to get the result you wanted.

So, instead of doing pl.read_csv(file="file_name.csv", infer_schema_length=0) and then manually casting values, just do pl.read_csv(file="file_name.csv", infer_schema_length=int(1e10)). If you still get the error then make 1e10 bigger or if it's too slow then make it smaller.

That answers your Q2.

Q1: You don't need to loop through all the values, you can just do df.columns[59] to get column 59. Also, instead of reading in the whole df just to get column names you can do pl.read_csv("file_name.csv", n_rows=1) which will take up a much smaller memory footprint

Q3: Yes/No. It depends on what you consider convenient and what is in the domain of formats of files you're opening. If you just want to have a hierarchy of try int, try float, keep utf then you could do:

for i in df.columns:
    try:
        df=df.with_column(pl.col(i).cast(pl.Int64()))
        continue
    except:
        pass
    try:
        df=df.with_column(pl.col(i).cast(pl.Float64()))
        continue
    except:
        pass
    

Upvotes: 2

Related Questions