Reputation: 11
I want to update a polars library dataframe,
polars syntax/command which I used for the purpose: df[0, 'A'] = 'some value'
but the above code gives an error: ValueError: cannot set with list/tuple as value; use a scalar value I am using polars 0.13.55
The above coode was previously working in polars 0.13.51
Minimal Code to reproduce the problem:
df = pl.DataFrame( { "IP": ['1.1.1.1', '2.2.2.2'], "ISP" :
["N/A", "N/A"] } )
isp_names = { '1.1.1.1' : 'ABC', '2.2.2.2' : 'XYZ' }
i = 0
for row in df.rows():
for ip, isp in isp_names.items():
if(row[0] == ip):
df[i, 'ISP'] = isp #**This line gives the Value error**
i = i + 1
Upvotes: 1
Views: 6174
Reputation: 1422
I am unable to reproduce the error on the latest version (0.13.56), so updating polars may help.
May I also suggest two improvements to the code, where the second improvements avoids the issue you run into altogether?
First, a more Pythonic version:
df = pl.DataFrame( {"IP": ['1.1.1.1', '2.2.2.2'],
"ISP": ["N/A", "N/A"] } )
isp_names = { '1.1.1.1' : 'ABC', '2.2.2.2' : 'XYZ' }
for i, row in enumerate(df.rows()):
df[i, 'ISP'] = isp_names[row[0]]
I.e., use enumerate
to keep your i
aligned with row
, and do not loop isp_names
separately but simply get the value by the key.
Second, Polars has an excellent expression system, meaning you do not have to pre-allocate ISP
column or write a loop:
df = pl.DataFrame( { "IP": ['1.1.1.1', '2.2.2.2']})
isp_names = { '1.1.1.1' : 'ABC', '2.2.2.2' : 'XYZ' }
df.with_column(pl.col("IP").map_elements(isp_names.get).alias("ISP"))
which returns df
as:
shape: (2, 2)
┌─────────┬─────┐
│ IP ┆ ISP │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ ABC │
│ 2.2.2.2 ┆ XYZ │
└─────────┴─────┘
Upvotes: 0
Reputation:
It looks as though you might be trying to update the values of DataFrame, particularly where values are missing (the "N/A" values).
In addition the advice of @jvz, I would recommend using a left join
for your purposes, rather than using a dictionary and a for loop. Using for loops is very slow, and is to be avoided. By contrast, a left join
will be very performant, and is built for exactly these types of situations.
We'll take this in steps.
First, let's first expand your example.
df = pl.DataFrame(
{"IP": ["1.1.1.1", "2.2.2.2", "3.3.3.3", "4.4.4.4"],
"ISP": ["N/A", "N/A", "PQR", "N/A"]}
)
df
shape: (4, 2)
┌─────────┬─────┐
│ IP ┆ ISP │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ N/A │
│ 2.2.2.2 ┆ N/A │
│ 3.3.3.3 ┆ PQR │
│ 4.4.4.4 ┆ N/A │
└─────────┴─────┘
Notice that we have three rows with "N/A" values, but one row that already has a valid value, "PQR".
Next, let's convert your dictionary of updated ISP values to a DataFrame, so that we can join the two DataFrames.
isp_df = pl.DataFrame(
data=[[key, value] for key, value in isp_names.items()],
schema=["IP", "ISP_updated"],
orient="row",
)
isp_df
shape: (2, 2)
┌─────────┬─────────────┐
│ IP ┆ ISP_updated │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════════════╡
│ 1.1.1.1 ┆ ABC │
│ 2.2.2.2 ┆ XYZ │
└─────────┴─────────────┘
Now, we simply join
the two DataFrames. The how="left"
ensures that we keep all rows from df
, even if there are no corresponding rows in isp_df
.
df.join(isp_df, on="IP", how="left")
shape: (4, 3)
┌─────────┬─────┬─────────────┐
│ IP ┆ ISP ┆ ISP_updated │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪═════╪═════════════╡
│ 1.1.1.1 ┆ N/A ┆ ABC │
│ 2.2.2.2 ┆ N/A ┆ XYZ │
│ 3.3.3.3 ┆ PQR ┆ null │
│ 4.4.4.4 ┆ N/A ┆ null │
└─────────┴─────┴─────────────┘
Notice the null
values in ISP_updated
. These are cases where you had no updated values for a particular IP
value.
To complete the process, we use fill_null
to copy the values from the ISP
column into the ISP_updated
column for those cases where isp_df
had no updates for a particular IP
value.
(
df
.join(isp_df, on="IP", how="left")
.with_columns(
pl.col("ISP_updated").fill_null(pl.col("ISP"))
)
)
shape: (4, 3)
┌─────────┬─────┬─────────────┐
│ IP ┆ ISP ┆ ISP_updated │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪═════╪═════════════╡
│ 1.1.1.1 ┆ N/A ┆ ABC │
│ 2.2.2.2 ┆ N/A ┆ XYZ │
│ 3.3.3.3 ┆ PQR ┆ PQR │
│ 4.4.4.4 ┆ N/A ┆ N/A │
└─────────┴─────┴─────────────┘
Now, your ISP_updated
column has the updated values for each ISP. If you want, you can drop and rename columns so that your final column is labeled ISP
.
(
df
.join(isp_df, on="IP", how="left")
.with_columns(
pl.col("ISP_updated").fill_null(pl.col("ISP"))
)
.drop("ISP")
.rename({"ISP_updated": "ISP"})
)
shape: (4, 2)
┌─────────┬─────┐
│ IP ┆ ISP │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ ABC │
│ 2.2.2.2 ┆ XYZ │
│ 3.3.3.3 ┆ PQR │
│ 4.4.4.4 ┆ N/A │
└─────────┴─────┘
As the size of your DataFrames gets large, you will definitely want to avoid using for loops. Using join
will be far faster.
Upvotes: 2