Reputation: 67
I would like to apply a specific function to specific columns using polars similar to the following question:
Above question works with pandas and it is taking ages for me to run it on my computer. So, I would like to use polars. Taking from the above question:
df = pd.DataFrame({'source': ['Paul', 'Paul'],
'target': ['GOOGLE', 'Ferrari'],
'edge': ['works at', 'drive']
})
source target edge
0 Paul GOOGLE works at
1 Paul Ferrari drive
Expected outcome with polars:
source target edge Entitiy
0 Paul GOOGLE works at Person
1 Paul Ferrari drive Person
!python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load('en_core_web_sm')
df['Entities'] = df['Text'].apply(lambda sent: [(ent.label_) for ent in nlp(sent).ents])
df['Entities'][1]
How can I add a column with label(Person) to the current dataframe with polars? Thank you.
Upvotes: 0
Views: 450
Reputation: 1884
You can run the apply in Polars with the following code:
df_pl.with_columns(
entities = pl.col('target').map_elements(
lambda sent: [(ent.label_) for ent in nlp(sent).ents])
)
As @jqurious mentioned, this should not be expected to be faster than Pandas. I ran a couple of tests and it takes the same time as Pandas.
In addition to the comments by @jqurious, you could reduce the number of times the apply function is called if some values are repeated.
You can do that by redefining the function with lru_cache:
from functools import lru_cache
import spacy
import polars as pl
nlp = spacy.load('en_core_web_sm')
@lru_cache(1024)
def cached_nlp(text):
return nlp(text)
df_pl.with_columns(
entities = pl.col('target').map_elements(
lambda sent: [(ent.label_) for ent in cached_nlp(sent).ents])
)
Upvotes: 2
Reputation: 881
Building slightly on @Luca's answer, you can add the caching one level up to avoid the additional list comprehension and jump straight to the list of entity labels:
Polars syntax shown, but equally applicable to pandas:
from functools import lru_cache
@lru_cache(2048) # << size appropriately
def entity_labels(s: str) -> list:
return [(ent.label_) for ent in nlp(s).ents]
df.with_columns(
pl.col("target").map_elements(
function = entity_labels,
return_dtype = pl.List(pl.String),
).alias("labels")
)
Upvotes: 3
Reputation: 21164
polars will suffer the same issue as pandas in this case.
Using .apply
means you're essentially using a python for loop.
You can attempt to run the UDF (User-defined function) in parallel with a multiprocessing Pool.
Depending on the particular function/dataset - it may or may not offer a speedup as multiprocessing itself has its own cost - it would be have to measured on a case-by-case basis.
In this case if I expand your example to 10_000 rows - it runs 4x faster.
import spacy
import polars as pl
from functools import partial
from multiprocessing import cpu_count, get_context
def recognize(source, nlp):
return [ent.label_ for ent in nlp(source).ents]
if __name__ == "__main__":
nlp = spacy.load("en_core_web_sm")
df = pl.DataFrame({
"source": ["Paul", "Paul"],
"target": ["GOOGLE", "Ferrari"],
"edge": ["works at", "drive"]
})
func = partial(recognize, nlp=nlp) # used to pass in `nlp`
n_workers = cpu_count() // 2 # experiment with value
with get_context("spawn").Pool(n_workers) as pool:
df = df.with_columns(Entity = pl.Series(pool.map(func, df.get_column("source"))))
print(df)
shape: (2, 4)
┌────────┬─────────┬──────────┬────────────┐
│ source | target | edge | Entity │
│ --- | --- | --- | --- │
│ str | str | str | list[str] │
╞════════╪═════════╪══════════╪════════════╡
│ Paul | GOOGLE | works at | ["PERSON"] │
│ Paul | Ferrari | drive | ["PERSON"] │
└────────┴─────────┴──────────┴────────────┘
Upvotes: 1