Wajahat Raza
Wajahat Raza

Reputation: 11

how to update the polars dataframe

I want to update a polars library dataframe,

polars syntax/command which I used for the purpose: df[0, 'A'] = 'some value'

but the above code gives an error: ValueError: cannot set with list/tuple as value; use a scalar value I am using polars 0.13.55

The above coode was previously working in polars 0.13.51

Minimal Code to reproduce the problem:

df = pl.DataFrame( { "IP": ['1.1.1.1', '2.2.2.2'], "ISP" :   
                    ["N/A", "N/A"] } )

isp_names = { '1.1.1.1' : 'ABC', '2.2.2.2' : 'XYZ' }

i = 0
for row in df.rows():
    for ip, isp in isp_names.items():
        if(row[0] == ip):
            df[i, 'ISP'] = isp #**This line gives the Value error**
    i = i + 1

Upvotes: 1

Views: 6174

Answers (2)

jvz
jvz

Reputation: 1422

I am unable to reproduce the error on the latest version (0.13.56), so updating polars may help.

May I also suggest two improvements to the code, where the second improvements avoids the issue you run into altogether?

First, a more Pythonic version:

df = pl.DataFrame( {"IP": ['1.1.1.1', '2.2.2.2'], 
                    "ISP": ["N/A", "N/A"] } )
isp_names = { '1.1.1.1' : 'ABC', '2.2.2.2' : 'XYZ' }

for i, row in enumerate(df.rows()):
    df[i, 'ISP'] = isp_names[row[0]]

I.e., use enumerate to keep your i aligned with row, and do not loop isp_names separately but simply get the value by the key.

Second, Polars has an excellent expression system, meaning you do not have to pre-allocate ISP column or write a loop:

df = pl.DataFrame( { "IP": ['1.1.1.1', '2.2.2.2']})
isp_names = { '1.1.1.1' : 'ABC', '2.2.2.2' : 'XYZ' }
df.with_column(pl.col("IP").map_elements(isp_names.get).alias("ISP"))

which returns df as:

shape: (2, 2)
┌─────────┬─────┐
│ IP      ┆ ISP │
│ ---     ┆ --- │
│ str     ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ ABC │
│ 2.2.2.2 ┆ XYZ │
└─────────┴─────┘

Upvotes: 0

user18559875
user18559875

Reputation:

It looks as though you might be trying to update the values of DataFrame, particularly where values are missing (the "N/A" values).

In addition the advice of @jvz, I would recommend using a left join for your purposes, rather than using a dictionary and a for loop. Using for loops is very slow, and is to be avoided. By contrast, a left join will be very performant, and is built for exactly these types of situations.

We'll take this in steps.

First, let's first expand your example.

df = pl.DataFrame(
    {"IP": ["1.1.1.1", "2.2.2.2", "3.3.3.3", "4.4.4.4"],
     "ISP": ["N/A", "N/A", "PQR", "N/A"]}
)
df
shape: (4, 2)
┌─────────┬─────┐
│ IP      ┆ ISP │
│ ---     ┆ --- │
│ str     ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ N/A │
│ 2.2.2.2 ┆ N/A │
│ 3.3.3.3 ┆ PQR │
│ 4.4.4.4 ┆ N/A │
└─────────┴─────┘

Notice that we have three rows with "N/A" values, but one row that already has a valid value, "PQR".

Next, let's convert your dictionary of updated ISP values to a DataFrame, so that we can join the two DataFrames.

isp_df = pl.DataFrame(
    data=[[key, value] for key, value in isp_names.items()],
    schema=["IP", "ISP_updated"],
    orient="row",
)
isp_df
shape: (2, 2)
┌─────────┬─────────────┐
│ IP      ┆ ISP_updated │
│ ---     ┆ ---         │
│ str     ┆ str         │
╞═════════╪═════════════╡
│ 1.1.1.1 ┆ ABC         │
│ 2.2.2.2 ┆ XYZ         │
└─────────┴─────────────┘

Now, we simply join the two DataFrames. The how="left" ensures that we keep all rows from df, even if there are no corresponding rows in isp_df.

df.join(isp_df, on="IP", how="left")
shape: (4, 3)
┌─────────┬─────┬─────────────┐
│ IP      ┆ ISP ┆ ISP_updated │
│ ---     ┆ --- ┆ ---         │
│ str     ┆ str ┆ str         │
╞═════════╪═════╪═════════════╡
│ 1.1.1.1 ┆ N/A ┆ ABC         │
│ 2.2.2.2 ┆ N/A ┆ XYZ         │
│ 3.3.3.3 ┆ PQR ┆ null        │
│ 4.4.4.4 ┆ N/A ┆ null        │
└─────────┴─────┴─────────────┘

Notice the null values in ISP_updated. These are cases where you had no updated values for a particular IP value.

To complete the process, we use fill_null to copy the values from the ISP column into the ISP_updated column for those cases where isp_df had no updates for a particular IP value.

(
    df
    .join(isp_df, on="IP", how="left")
    .with_columns(
        pl.col("ISP_updated").fill_null(pl.col("ISP"))
    )
)
shape: (4, 3)
┌─────────┬─────┬─────────────┐
│ IP      ┆ ISP ┆ ISP_updated │
│ ---     ┆ --- ┆ ---         │
│ str     ┆ str ┆ str         │
╞═════════╪═════╪═════════════╡
│ 1.1.1.1 ┆ N/A ┆ ABC         │
│ 2.2.2.2 ┆ N/A ┆ XYZ         │
│ 3.3.3.3 ┆ PQR ┆ PQR         │
│ 4.4.4.4 ┆ N/A ┆ N/A         │
└─────────┴─────┴─────────────┘

Now, your ISP_updated column has the updated values for each ISP. If you want, you can drop and rename columns so that your final column is labeled ISP.

(
    df
    .join(isp_df, on="IP", how="left")
    .with_columns(
        pl.col("ISP_updated").fill_null(pl.col("ISP"))
    )
    .drop("ISP")
    .rename({"ISP_updated": "ISP"})
)
shape: (4, 2)
┌─────────┬─────┐
│ IP      ┆ ISP │
│ ---     ┆ --- │
│ str     ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ ABC │
│ 2.2.2.2 ┆ XYZ │
│ 3.3.3.3 ┆ PQR │
│ 4.4.4.4 ┆ N/A │
└─────────┴─────┘

As the size of your DataFrames gets large, you will definitely want to avoid using for loops. Using join will be far faster.

Upvotes: 2

Related Questions