Reputation: 761
I need some tips to make a calculation.
I have a DataFrame that looks like the following:
text_id user date important_words
1 John 2018-01-01 {cat, dog, puppy}
1 John 2018-02-01 {cat, dog}
2 Anne 2018-01-01 {flower, sun}
3 John 2018-03-01 {water, blue}
3 Marie 2018-05-01 {water, blue, ocean}
3 Kate 2018-08-01 {island, sand, towel}
4 Max 2018-01-01 {hot, cold}
4 Ethan 2018-06-01 {hot, warm}
5 Marie 2019-01-01 {boo}
In the given dataframe:
the text_id
refers to the id of a text: each text with a different id is a different text. The user
column refers to the name of the user that has edited the text (adding and erasing important words). The date
column refers to the moment in which the edit was made (note that edits on each text are temporarilly sorted). Finally, the important_words
column is a set of important words present in the text after the edit of the user.
I need to calculate how many words were added by each user on each edition of a page.
The expected output here would be:
text_id user date important_words added_words
1 John 2018-01-01 {cat, dog, puppy} 3
1 John 2018-02-01 {cat, dog} 0
2 Anne 2018-01-01 {flower, sun} 2
3 John 2018-03-01 {water, blue} 2
3 Marie 2018-05-01 {water, blue, ocean} 1
3 Kate 2018-08-01 {island, sand, towel} 3
4 Max 2018-01-01 {hot, cold} 2
4 Ethan 2018-06-01 {hot, warm} 1
5 Marie 2019-01-01 {boo} 1
Note that the first time editing the text is the creation, so the number of words added is always the size of the important_words
set in that case.
Any tips on what would be the fastest way to compute the added_words
column will be highly appreciated.
Note that the important_words
column contains a set, thus the operation of calculating the difference among two consecutive editions should be easy.
Upvotes: 2
Views: 65
Reputation: 765
col1=df1.important_words.str.replace('{|}|\s','').str.split(',')
col2=df1.assign(important_words=col1).groupby('text_id').important_words.transform(lambda ss:ss.cumsum().shift(fill_value=[[]]))
col3=df1.assign(col1=col1,col2=col2).apply(lambda ss: len(set(ss.col1).difference(set(ss.col2))), 1)
df1.assign(added_words=col3)
text_id user date important_words added_words
1 John 2018-01-01 {cat, dog, puppy} 3
1 John 2018-02-01 {cat, dog} 0
2 Anne 2018-01-01 {flower, sun} 2
3 John 2018-03-01 {water, blue} 2
3 Marie 2018-05-01 {water, blue, ocean} 1
3 Kate 2018-08-01 {island, sand, towel} 3
4 Max 2018-01-01 {hot, cold} 2
4 Ethan 2018-06-01 {hot, warm} 1
5 Marie 2019-01-01 {boo} 1
Upvotes: 0
Reputation: 323386
Hard to think but interesting :-) I am using get_dummies
, then we just keep the first 1
value per columns and sum
them
s=df.important_words.map(','.join).str.get_dummies(sep=',')
s.mask(s==0).cumsum().eq(1).sum(1)
Out[247]:
0 3
1 0
2 2
3 2
4 1
5 3
6 2
7 1
8 1
dtype: int64
df['val']=s.mask(s==0).cumsum().eq(1).sum(1)
Update
s=df.important_words.map(','.join).str.get_dummies(sep=',')
s.mask(s==0).groupby(df['text_id']).cumsum().eq(1).sum(1)
Upvotes: 2