Reputation: 154
I have a data set which looks something like this
df['pos_tag']
0 [(colin, NN), (eats, VB), (cake, NN)]
1 [(paris, NN), (kicks, VB), (ball, NN)]
2 [(jackson, NN), (watches, VB), (television, NN)]
3 [(joyce, NN), (drinks, VB), (water, NN)]
4 [(oscar, NN), (wins, VB), (award, NN)]
I want to write a function to count the occurrences of each parts of speech
def count_pos_tag(dfcol):
values = []
for row in dfcol:
count = [0, 0]
for token, tag in row:
if tag.startswith('NN'):
count[0] += 1
elif tag.startswith('VB'):
count[1] += 1
values.append(count)
return values
values = count_pos_tag(df['pos_tag'])
I have noticed that it takes up some time as I am running it on a big data set. Is there another way I could use to make the processing faster?
Upvotes: 1
Views: 28
Reputation: 59549
You need to re-think your organization. pandas
is meant for 2D arrays of simple data (i.e. scalars like int
, str
, datetime64ns
) not complex objects like lists, tuples, dicts, or in this case a list of tuples.
Once we reshape the data to a simpler organization all you need to do is groupby
+ value_counts
to get the counts per part of speech per row from the original DataFrame. The key here is that the re-shaped DataFrame has each word and pos split into a single cell, and the index of this DataFrame is no longer unique, but points back to the original index.
import pandas as pd
df = pd.DataFrame({'pos_tag': [[('colin', 'NN'), ('eats', 'VB'), ('cake', 'NN')],
[('paris', 'NN'), ('kicks', 'VB'), ('ball', 'NN')],
[('jackson', 'NN'), ('watches', 'VB'), ('television', 'NN')],
[('joyce', 'NN'), ('drinks', 'VB'), ('water', 'NN')],
[('oscar', 'NN'), ('wins', 'VB'), ('award', 'NN')]]})
s = df['pos_tag'].explode()
df1 = pd.DataFrame(s.to_list(), index=s.index, columns=['word', 'pos'])
# word pos
#0 colin NN
#0 eats VB
#0 cake NN
#1 paris NN
#... ... ..
#3 water NN
#4 oscar NN
#4 wins VB
#4 award NN
df1.groupby(level=0).pos.value_counts()
pos
0 NN 2
VB 1
1 NN 2
VB 1
2 NN 2
VB 1
3 NN 2
VB 1
4 NN 2
VB 1
Name: pos, dtype: int64
Upvotes: 1