Reputation: 723
I have a word list of type list
containing large number of English words.
I also have a dataframe which looks like:-
FileName PageNo LineNo GOODS_DESC
1 17743633 - 1 TM000002 69 Abuj Cen Le
31 17743633 - 1 TM000007 126 Mr USD
33 17743633 - 1 TM000008 22 TABLEAU EMBALLAGE
34 17743633 - 1 TM000008 24 LISA e EMBALV
46 17743633 - 1 TM000008 143 Cen
47 17743633 - 1 TM000008 146 A Gl
50 17743633 - 1 TM000009 121 Ppvv Tn Ppvv In
51 17743633 - 1 TM000009 129 SPECIFY
52 17743633 - 1 TM000009 136 Decrp G
58 17743633 - 1 TM000009 97 Je ugn
60 17743633 - 1 TM000009 108 De Veel
61 17743633 - 1 TM000014 44 TYRE CHIPS SHREDDED TYRES
63 17743633 - 1 TM000014 48 TYRE CHIPS SHREDDED TYRES
I want to keep only those words in the 'GOODS_DESC' column that are present in the words list.
My desired output is:-
FileName PageNo LineNo GOODS_DESC
1 17743633 - 1 TM000002 69 NaN
31 17743633 - 1 TM000007 126 Mr USD
33 17743633 - 1 TM000008 22 TABLEAU
34 17743633 - 1 TM000008 24 LISA
46 17743633 - 1 TM000008 143 NaN
47 17743633 - 1 TM000008 146 NaN
50 17743633 - 1 TM000009 121 NaN
51 17743633 - 1 TM000009 129 SPECIFY
52 17743633 - 1 TM000009 136 NaN
58 17743633 - 1 TM000009 97 NaN
60 17743633 - 1 TM000009 108 NaN
61 17743633 - 1 TM000014 44 TYRE CHIPS SHREDDED TYRES
63 17743633 - 1 TM000014 48 TYRE CHIPS SHREDDED TYRES
My approach is also giving output but I'm using lists and it is slow. I want to make it fast.
for rows in df.itertuples():
a = []
flat_list = []
a.append(rows.GOODS_DESC)
flat_list = [item.strip() for sublist in a for item in sublist.split(' ') if item.strip()]
flat_list = list(sorted(set(flat_list), key=flat_list.index))
flat_list = [i for i in flat_list if i.lower() in word_list]
if(not flat_list):
df.drop(rows.Index,inplace=True)
continue
s=' '.join(flat_list)
df.loc[rows.Index,'GOODS_DESC']=s
df['GOODS_DESC'] = df['GOODS_DESC'].str.upper()
Upvotes: 4
Views: 122
Reputation: 29635
you can use the merge
and join
to do what you want. First some preparation:
#input file, you can have any number of columns in it
df_input = pd.DataFrame({'col1':range(1,5),
'GOODS_DESC':['Abuj Cen Le', 'Mr USD',
'TABLEAU EMBALLAGE', 'LISA e EMBALV']})
# DF from the list of words
df_word = pd.DataFrame({'Word':[word.lower() for word in word_list]})
# create a df_stack with each word of your sentences as a row keeping indexes for join later
df_stack = (df_input['GOODS_DESC'].str.split(' ',expand=True).stack().reset_index())
# Column with same name as df_word and lowercase for merge after
df_stack['Word'] = df_stack[0].str.lower()
Now you can use merge
and then join
:
df_join = df_stack.merge(df_word).groupby('level_0')[0].apply(lambda x: ' '.join(x))
df_output = df_input.join(df_join)
# get ride
df_output = df_output.drop('GOODS_DESC',1).rename(columns={0:'GOODS_DESC'})
It looks like a lot of lines, but merge
and join
are efficient so I hope it will be fast enough.
Upvotes: 1
Reputation: 164773
Your logic seems overly complicated. You can use a single list comprehension with pd.Series.apply
. I recommend, as below, you use set
for O(1) lookup and str.casefold
to match strings irrespective of case.
s = pd.Series(['Abuj Cen Le', 'Mr USD', 'TABLEAU EMBALLAGE', 'LISA e EMBALV'])
word_set = {i.casefold() for i in ['Mr', 'USD', 'TABLEAU', 'LISA']}
def apply_filter(x):
out = ' '.join([i for i in x.split() if i.casefold() in word_set])
return out if out else np.nan
res = s.apply(apply_filter)
print(res)
0 NaN
1 Mr USD
2 TABLEAU
3 LISA
dtype: object
Upvotes: 1