Reputation: 93
I have a text field (multiple) where I have to clean it remove, stopwords and count each words' occurrence and store it in a new dataframe where the grain is the word and row_key.
from the above dataframe I need to create a record for each word and row_key
right now I am using
def function(df):
global f_target_table
global row_key
global column_name
for word in df.words:
f_target_table.loc[row_key] = pd.Series({'row_key':row_key,'column_name':column_name,'source_id':df.source_id,'table_row_key':df.row_key,'keyword':word,'occurrence':df['cleaned_words'][word],'etl_run_number':df.etl_run_number,'soft_deleted_flag':df.soft_deleted_flag,'table_name':'abc'})
row_key += 1
df_main.swifter.allow_dask_on_strings().apply(function,axis = 1)
my current logic is O(n^2) would like to reduce the time-complexity of the snippet
the required output
Upvotes: 0
Views: 102
Reputation: 13447
Not sure if this can help but, have you tried something like
import pandas as pd
from collections import Counter
from string import punctuation
def strip_punctuation(s):
return ''.join(c for c in s if c not in punctuation)
df = pd.DataFrame({"text":["Apple, pear, peach, apple.",
"Car, bus, train, plane."],
"row_key":[4,5]})
df["words"] = df["text"].map(strip_punctuation)\
.str.lower()\
.str.split(" ")
def fun(x):
s = pd.Series(Counter(x["words"]))
s = s.reset_index(name="occurrence")\
.rename(columns={"index":"keyword"})
s["table_row_key"] = x["row_key"]
return s
out = df.apply(fun, axis=1)
out = pd.concat(out.values, ignore_index=True)
UPDATE
The function fun
has a row as input. So for every row in df
fun is feeded by something like x = df.iloc[0]
. In this case
x["words"]
is ['apple', 'pear', 'peach', 'apple']
. Counter(x["words"])
returns Counter({'apple': 2, 'pear': 1, 'peach': 1})
which is basically a dict
s = pd.Series(Counter(x["words"]))
returns a series with keys as index and values as valuesapple 2
pear 1
peach 1
dtype: int64
s
to a dataframe we should reset_index
. So s.reset_index()
return index 0
0 apple 2
1 pear 1
2 peach 1
index
to keyword
and 0
to occurrence
to do so the following commands are equivalents = s.reset_index()\
.rename(columns={"index":"keyword",
0:"occurrence"})
or, alternatively
s = s.reset_index(name="occurrence")\
.rename(columns={"index":"keyword"})
row_key
to this seriess["table_row_key"] = x["row_key"]
When you use fun
in apply you get a really ugly looking series
out = df.apply(fun, axis=1)
0 keyword occurrence table_row_key
0 apple...
1 keyword occurrence table_row_key
0 car...
dtype: object
where actually every value is a pd.DataFrame
. Concatening the values you get your desired output.
I see you are aware of dask
in case you want to use it the code is mostly the same
# Here you are reading from file
df = pd.DataFrame({"text":["Apple, pear, peach, apple.",
"Car, bus, train, plane."],
"row_key":[4,5]})
df = dd.from_pandas(df, npartitions=2)
# Juicy
df["words"] = df["text"].map(strip_punctuation)\
.str.lower()\
.str.split(" ")
# here is the only difference as
# with dask apply you have to specify your meta
out = df.apply(fun, axis=1, meta=pd.DataFrame)
# You need to compute as dask is lazy
out = out.compute()
In case you have your data in several partitions it could be better (faster) to use
out = df.map_partitions(lambda x: x.apply(fun, axis=1))
out = out.compute()
Upvotes: 1