Peter
Peter

Reputation: 96

Rolling unique value count in pandas

I have a DataFrame containing certain words with a unique identifier time that is ascending.

For example:

words = ["a", "b", "a", "c", "d", "a", "b"]
df = pandas.DataFrame(words, columns=["word"])
df.index.name = "time"
     word
time     
0       a
1       b
2       a
3       c
4       d
5       a
6       b

I would like to count the number of unique values in the word column over time, to get a table like this:

     word  rolling_unique_word_count
time                                
0       a                          1
1       b                          2
2       a                          2
3       c                          3
4       d                          4
5       a                          4
6       b                          4

Are there any build-in functions to perform this task?

Upvotes: 2

Views: 898

Answers (3)

Leonardus Chen
Leonardus Chen

Reputation: 1234

One liner using .duplicated

df['new'] = (~df['word'].duplicated()).cumsum()

Output

        word  new
time          
0       a    1
1       b    2
2       a    2
3       c    3
4       d    4
5       a    4
6       b    4

Upvotes: 2

ombk
ombk

Reputation: 2111

def sum_unique(x):
    already_in = []
    list_acc = []
    counter = 0
    for i in x:
        if i in already_in:
            list_acc.append(counter)
        else:
            counter+=1
            list_acc.append(counter)
        already_in.append(i)

    return list_acc

df["unique"] = sum_unique(df.word)

#output

    word unique
t       
0    a     1
1    b     2
2    a     2
3    c     3
4    d     4
5    a     4
6    b     4

I know it isn't the most pretty or optimal way. But it does the job.

It might not be fast at all, but it works as needed.

Upvotes: 1

jezrael
jezrael

Reputation: 862406

Here Series.expanding working with numeric only, so column was factorized first and then is used custom function for count number of unique values:

s = pd.Series(pd.factorize(df['word'])[0], index=df.index)
df['new'] = s.expanding().apply(lambda x: len(np.unique(x)))
#alternative
#df['new'] = s.expanding().apply(lambda x: len(set(x)))
print (df)
     word  new
time          
0       a  1.0
1       b  2.0
2       a  2.0
3       c  3.0
4       d  4.0
5       a  4.0
6       b  4.0

Upvotes: 3

Related Questions