HRDSL
HRDSL

Reputation: 761

column that depends on computing the difference among two column cells in groupby object

I need some tips to make a calculation.

I have a DataFrame that looks like the following:

text_id      user     date        important_words
1            John    2018-01-01   {cat, dog, puppy}           
1            John    2018-02-01   {cat, dog}
2            Anne    2018-01-01   {flower, sun}
3            John    2018-03-01   {water, blue}
3            Marie   2018-05-01   {water, blue, ocean}
3            Kate    2018-08-01   {island, sand, towel}
4            Max     2018-01-01   {hot, cold}
4            Ethan   2018-06-01   {hot, warm}
5            Marie   2019-01-01   {boo}

In the given dataframe:

the text_id refers to the id of a text: each text with a different id is a different text. The user column refers to the name of the user that has edited the text (adding and erasing important words). The date column refers to the moment in which the edit was made (note that edits on each text are temporarilly sorted). Finally, the important_words column is a set of important words present in the text after the edit of the user.

I need to calculate how many words were added by each user on each edition of a page.

The expected output here would be:

text_id      user     date        important_words        added_words
1            John    2018-01-01   {cat, dog, puppy}      3        
1            John    2018-02-01   {cat, dog}             0
2            Anne    2018-01-01   {flower, sun}          2
3            John    2018-03-01   {water, blue}          2
3            Marie   2018-05-01   {water, blue, ocean}   1
3            Kate    2018-08-01   {island, sand, towel}  3
4            Max     2018-01-01   {hot, cold}            2
4            Ethan   2018-06-01   {hot, warm}            1
5            Marie   2019-01-01   {boo}                  1

Note that the first time editing the text is the creation, so the number of words added is always the size of the important_words set in that case.

Any tips on what would be the fastest way to compute the added_words column will be highly appreciated.

Note that the important_words column contains a set, thus the operation of calculating the difference among two consecutive editions should be easy.

Upvotes: 2

Views: 65

Answers (2)

G.G
G.G

Reputation: 765

col1=df1.important_words.str.replace('{|}|\s','').str.split(',')

col2=df1.assign(important_words=col1).groupby('text_id').important_words.transform(lambda ss:ss.cumsum().shift(fill_value=[[]]))

col3=df1.assign(col1=col1,col2=col2).apply(lambda ss: len(set(ss.col1).difference(set(ss.col2))), 1)

df1.assign(added_words=col3)


text_id      user     date        important_words        added_words
1            John    2018-01-01   {cat, dog, puppy}      3        
1            John    2018-02-01   {cat, dog}             0
2            Anne    2018-01-01   {flower, sun}          2
3            John    2018-03-01   {water, blue}          2
3            Marie   2018-05-01   {water, blue, ocean}   1
3            Kate    2018-08-01   {island, sand, towel}  3
4            Max     2018-01-01   {hot, cold}            2
4            Ethan   2018-06-01   {hot, warm}            1
5            Marie   2019-01-01   {boo}                  1

Upvotes: 0

BENY
BENY

Reputation: 323386

Hard to think but interesting :-) I am using get_dummies, then we just keep the first 1 value per columns and sum them

s=df.important_words.map(','.join).str.get_dummies(sep=',')
s.mask(s==0).cumsum().eq(1).sum(1)
Out[247]: 
0    3
1    0
2    2
3    2
4    1
5    3
6    2
7    1
8    1
dtype: int64
df['val']=s.mask(s==0).cumsum().eq(1).sum(1)

Update

s=df.important_words.map(','.join).str.get_dummies(sep=',')
s.mask(s==0).groupby(df['text_id']).cumsum().eq(1).sum(1)

Upvotes: 2

Related Questions