Reputation: 3355
I want to combine rows in a way where I concatenate values for a specific column but get some unexpected result on my own dataset. Here is an example.
df = pd.DataFrame({'id':['1', '2', '3', '1', '3', '4', '4', '6', '6'],
'words':['a', 'b', 'c', 'b', 'a', 'a', 'b', 'c', 'a' ]})
df2 = df.groupby('id')['words'].apply(' '.join).reset_index()
df2.head()
The result looks like this which is what I want and is fine.
id words
0 1 a b
1 2 b
2 3 c a
3 4 a b
4 6 c a
Unique values based on words column and again it looks fine:
df2.words.value_counts()
c a 2
a b 2
b 1
Name: words, dtype: int64
However, in my own dataset which is big (can't really reproduce it here), the output of df2.words.value_counts()
produces something like this and I can't figure it out why. Any idea what could be going wrong here?
df2.words.value_counts()
c a 10
a c 5
a b 10
b a 5
b 1
Name: words, dtype: int64
But it should be as follows:
df2.words.value_counts()
c a 15
a b 10
b 1
Name: words, dtype: int64
The values here are fake but I get the same values for 'words' column.
Any ideas?
Upvotes: 1
Views: 26
Reputation: 862691
In my opinion simpliest is sorting values in join
function, so value_counts
working correct:
df2 = df.groupby('id')['words'].apply(lambda x: ' '.join(sorted(x))).reset_index()
print (df2)
id words
0 1 a b
1 2 b
2 3 a c
3 4 a b
4 6 a c
print (df2.words.value_counts())
a b 2
a c 2
b 1
Name: words, dtype: int64
Upvotes: 1