How to count words in each record and store it at word and key level in best possible time?

Question

I have a text field (multiple) where I have to clean it remove, stopwords and count each words' occurrence and store it in a new dataframe where the grain is the word and row_key. from the above dataframe I need to create a record for each word and row_key right now I am using

def function(df):
global f_target_table
global row_key
global column_name
for word in df.words:
    f_target_table.loc[row_key] =  pd.Series({'row_key':row_key,'column_name':column_name,'source_id':df.source_id,'table_row_key':df.row_key,'keyword':word,'occurrence':df['cleaned_words'][word],'etl_run_number':df.etl_run_number,'soft_deleted_flag':df.soft_deleted_flag,'table_name':'abc'})
    row_key += 1
df_main.swifter.allow_dask_on_strings().apply(function,axis = 1)

my current logic is O(n^2) would like to reduce the time-complexity of the snippet the required output

rpanai · Accepted Answer

Not sure if this can help but, have you tried something like

import pandas as pd
from collections import Counter
from string import punctuation

def strip_punctuation(s):
    return ''.join(c for c in s if c not in punctuation)

df = pd.DataFrame({"text":["Apple, pear, peach, apple.",
                           "Car, bus, train, plane."],
                   "row_key":[4,5]})

df["words"] = df["text"].map(strip_punctuation)\
                        .str.lower()\
                        .str.split(" ")

def fun(x):
    s = pd.Series(Counter(x["words"]))
    s = s.reset_index(name="occurrence")\
         .rename(columns={"index":"keyword"})
    s["table_row_key"] = x["row_key"]
    return s

out = df.apply(fun, axis=1)
out = pd.concat(out.values, ignore_index=True)

UPDATE

Explanation

The function fun has a row as input. So for every row in df fun is feeded by something like x = df.iloc[0]. In this case

x["words"] is ['apple', 'pear', 'peach', 'apple'].
Counter(x["words"]) returns Counter({'apple': 2, 'pear': 1, 'peach': 1}) which is basically a dict
s = pd.Series(Counter(x["words"])) returns a series with keys as index and values as values

apple    2
pear     1
peach    1
dtype: int64

to transform s to a dataframe we should reset_index. So s.reset_index() return

   index  0
0  apple  2
1   pear  1
2  peach  1

Here we would like to rename index to keyword and 0 to occurrence to do so the following commands are equivalent

s = s.reset_index()\
     .rename(columns={"index":"keyword",
                      0:"occurrence"})

or, alternatively

s = s.reset_index(name="occurrence")\
     .rename(columns={"index":"keyword"})

Finally you want to add the row_key to this series

s["table_row_key"] = x["row_key"]

When you use fun in apply you get a really ugly looking series

out = df.apply(fun, axis=1)

0      keyword  occurrence  table_row_key
0   apple...
1      keyword  occurrence  table_row_key
0     car...
dtype: object

where actually every value is a pd.DataFrame. Concatening the values you get your desired output.

Extra

I see you are aware of dask in case you want to use it the code is mostly the same

# Here you are reading from file
df = pd.DataFrame({"text":["Apple, pear, peach, apple.",
                           "Car, bus, train, plane."],
                   "row_key":[4,5]})
df = dd.from_pandas(df, npartitions=2)
# Juicy
df["words"] = df["text"].map(strip_punctuation)\
                        .str.lower()\
                        .str.split(" ")

# here is the only difference as 
# with dask apply you have to specify your meta
out = df.apply(fun, axis=1, meta=pd.DataFrame)
# You need to compute as dask is lazy
out = out.compute()

In case you have your data in several partitions it could be better (faster) to use

out = df.map_partitions(lambda x: x.apply(fun, axis=1))
out =  out.compute()

How to count words in each record and store it at word and key level in best possible time?

Answers (1)

Explanation

Extra

Related Questions