Alk90
Alk90

Reputation: 87

Pandas isin function in polars

Once in a while I get to the point where I need to run the following line:

DF['is_flagged'] = DF['id'].isin(DF2[DF2['flag']==1]['id'])

Lately I started using polars, and I wonder how to convert it easily to polars.

For example:

df1 = pl.DataFrame({
'Animal_id': [1, 2, 3, 4, 5, 6, 7],     
'age': [4, 6, 3, 8, 3, 8, 9] })

df2 = pl.DataFrame({     
'Animal_id': [1, 2, 3, 4, 5, 6, 7],     
'Animal_type': ['cat', 'dog', 'cat', 'cat', 'dog', 'dog', 'cat'] })

Expected output:

shape: (7, 3)
┌───────────┬─────┬────────┐
│ animal_id ┆ age ┆ is_dog │
│ ---       ┆ --- ┆ ---    │
│ i64       ┆ i64 ┆ i64    │
╞═══════════╪═════╪════════╡
│ 1         ┆ 4   ┆ 0      │
│ 2         ┆ 6   ┆ 1      │
│ 3         ┆ 3   ┆ 0      │
│ 4         ┆ 8   ┆ 0      │
│ 5         ┆ 3   ┆ 1      │
│ 6         ┆ 8   ┆ 1      │
│ 7         ┆ 9   ┆ 0      │
└───────────┴─────┴────────┘

Without using flag and then join

I tried to use the is_in() function but this didn’t worked.

Upvotes: 5

Views: 7360

Answers (2)

ignoring_gravity
ignoring_gravity

Reputation: 10476

How about

df1.with_columns(
    df2.with_columns(
        pl.col("Animal_type")
        .is_in(["dog"])
        .cast(int)
        .alias("is_dog")
    )["is_dog"]
)

This gives you:

Out[13]:
shape: (7, 3)
┌───────────┬─────┬────────┐
│ Animal_id ┆ age ┆ is_dog │
│ ---       ┆ --- ┆ ---    │
│ i64       ┆ i64 ┆ i64    │
╞═══════════╪═════╪════════╡
│ 1         ┆ 4   ┆ 0      │
│ 2         ┆ 6   ┆ 1      │
│ 3         ┆ 3   ┆ 0      │
│ 4         ┆ 8   ┆ 0      │
│ 5         ┆ 3   ┆ 1      │
│ 6         ┆ 8   ┆ 1      │
│ 7         ┆ 9   ┆ 0      │
└───────────┴─────┴────────┘

Upvotes: 1

alexp
alexp

Reputation: 1054

So to be honest I am not quite sure from your question, your pandas snippet and your example what your desired solution is, but here are my three takes.

import polars as pl

df1 = pl.DataFrame(
    {"Animal_id": [1, 2, 3, 4, 5, 6, 7], "age": [4, 6, 3, 8, 3, 8, 9]}
).lazy()

df2 = pl.DataFrame(
    {
        "Animal_id": [1, 2, 3, 4, 5, 6, 7],
        "Animal_type": ["cat", "dog", "cat", "cat", "dog", "dog", "cat"],
    }
).lazy()

1 Solution

So this one is a small adaptation of the solution of @ignoring_gravity. So the assumption in his solution is that the DataFrames have the same length and the Animal_id matches in both tables. If that was your goal I want to give you another solution because by subsetting (["is_dog"]) you lose the possibility to use the lazy api.

df1.with_context(
    df2.select(pl.col("Animal_type").is_in(["dog"]).cast(int).alias("is_dog"))
).select(["Animal_id", "age", "is_dog"]).collect()

2 Solution

So in case you want something more similar to your pandas snippet and because you wrote you don't want to have a join.

df1.with_context(
    df2.filter(pl.col("Animal_type") == "dog").select(
        pl.col("Animal_id").alias("valid_ids")
    )
).with_columns(
    [pl.col("Animal_id").is_in(pl.col("valid_ids")).cast(int).alias("is_dog")]
).collect()

3 Solution

So this would be the solution with a join. In my opinion the best solution regarding your example and example output, but maybe there are other reasons that speak against a join, which aren't apparent from your example.

df1.join(
    df2.select(
        ["Animal_id", pl.col("Animal_type").is_in(["dog"]).cast(int).alias("is_dog")]
    ),
    on=["Animal_id"],
).collect()

Upvotes: 3

Related Questions