Reputation: 96
I have a DataFrame containing certain words with a unique identifier time
that is ascending.
For example:
words = ["a", "b", "a", "c", "d", "a", "b"]
df = pandas.DataFrame(words, columns=["word"])
df.index.name = "time"
word
time
0 a
1 b
2 a
3 c
4 d
5 a
6 b
I would like to count the number of unique values in the word
column over time
, to get a table like this:
word rolling_unique_word_count
time
0 a 1
1 b 2
2 a 2
3 c 3
4 d 4
5 a 4
6 b 4
Are there any build-in functions to perform this task?
Upvotes: 2
Views: 898
Reputation: 1234
One liner using .duplicated
df['new'] = (~df['word'].duplicated()).cumsum()
Output
word new
time
0 a 1
1 b 2
2 a 2
3 c 3
4 d 4
5 a 4
6 b 4
Upvotes: 2
Reputation: 2111
def sum_unique(x):
already_in = []
list_acc = []
counter = 0
for i in x:
if i in already_in:
list_acc.append(counter)
else:
counter+=1
list_acc.append(counter)
already_in.append(i)
return list_acc
df["unique"] = sum_unique(df.word)
#output
word unique
t
0 a 1
1 b 2
2 a 2
3 c 3
4 d 4
5 a 4
6 b 4
I know it isn't the most pretty or optimal way. But it does the job.
It might not be fast at all, but it works as needed.
Upvotes: 1
Reputation: 862406
Here Series.expanding
working with numeric only, so column was factorized first and then is used custom function for count number of unique values:
s = pd.Series(pd.factorize(df['word'])[0], index=df.index)
df['new'] = s.expanding().apply(lambda x: len(np.unique(x)))
#alternative
#df['new'] = s.expanding().apply(lambda x: len(set(x)))
print (df)
word new
time
0 a 1.0
1 b 2.0
2 a 2.0
3 c 3.0
4 d 4.0
5 a 4.0
6 b 4.0
Upvotes: 3