utengr
utengr

Reputation: 3355

Combining duplicate dataframe rows with concatenating values for a specific column

I want to combine rows in a way where I concatenate values for a specific column but get some unexpected result on my own dataset. Here is an example.

df = pd.DataFrame({'id':['1', '2', '3', '1', '3', '4', '4', '6', '6'],
                'words':['a', 'b', 'c', 'b', 'a', 'a', 'b', 'c', 'a' ]})
df2 = df.groupby('id')['words'].apply(' '.join).reset_index()

df2.head()

The result looks like this which is what I want and is fine.

    id  words
0   1   a b
1   2   b
2   3   c a
3   4   a b
4   6   c a

Unique values based on words column and again it looks fine:

df2.words.value_counts()
c a    2
a b    2
b      1
Name: words, dtype: int64

However, in my own dataset which is big (can't really reproduce it here), the output of df2.words.value_counts() produces something like this and I can't figure it out why. Any idea what could be going wrong here?

df2.words.value_counts()
c a    10
a c    5
a b    10
b a    5
b      1
Name: words, dtype: int64

But it should be as follows:

df2.words.value_counts()
c a    15
a b    10
b      1
Name: words, dtype: int64

The values here are fake but I get the same values for 'words' column.

Any ideas?

Upvotes: 1

Views: 26

Answers (1)

jezrael
jezrael

Reputation: 862691

In my opinion simpliest is sorting values in join function, so value_counts working correct:

df2 = df.groupby('id')['words'].apply(lambda x: ' '.join(sorted(x))).reset_index()
print (df2)
  id words
0  1   a b
1  2     b
2  3   a c
3  4   a b
4  6   a c

print (df2.words.value_counts())
a b    2
a c    2
b      1
Name: words, dtype: int64

Upvotes: 1

Related Questions