Reputation: 449
i trying to find out if there is away to remove duplicate in my data frame while concatenating the value
example:
df
key v1 v2
0 1 n/a a
1 2 n/a b
2 3 n/a c
3 2 n/a d
4 3 n/a e
the out put should be like:
df_out
key v1 v2
0 1 n/a a
1 2 n/a b,d
2 3 n/a c,e
I try using df.drop_duplicates() and some loop to save the v2 column value and nothing yet. i'm trying to do it nice and clean with out loop by using Pandas.
some one know a way pandas can do it?
Upvotes: 3
Views: 7032
Reputation: 11
I was able to implement the following code but I have an additional requirement with this where I can have upto 300 joins in 'CONCAT' and the rest should be in separate row/rows. Example: if I have 750 values in 'CONCAT' with similar 'ISSUEID', first 300 will be in row 1, next 300 in row 2 and remaining 150 in row 3. Is there a way I can tweak this?
df = (data.groupby('ISSUEID')
.agg({'KICKCODE' : 'first', 'CONCAT' : ','.join})
.reset_index()
.reindex(columns=data.columns))
Upvotes: 1
Reputation: 402814
This should be easy, assuming you have two columns. Use groupby
+ agg
. v1
should be aggregated by first
, and v2
should be aggregated by ','.join
.
df
key v1 v2
0 1 NaN a
1 2 NaN b
2 3 NaN c
3 2 NaN d
4 3 NaN e
(df.groupby('key')
.agg({'v1' : 'first', 'v2' : ','.join})
.reset_index()
.reindex(columns=df.columns))
key v1 v2
0 1 NaN a
1 2 NaN b,d
2 3 NaN c,e
If you have multiple such columns requiring the same aggregation, build an agg dict called f
and pass it to agg
.
Upvotes: 7
Reputation: 7994
Use apply
pandas.core.groupby.GroupBy.apply
GroupBy.apply(func, *args, **kwargs)[source]
Apply function func group-wise and combine the results together.
df.groupby(["key", "v1"])["v2"].apply(list) # or apply(set) depending on your needs
Output:
key v1
1 n/a [a]
2 n/a [b, d]
3 n/a [c, e]
Name: v2, dtype: object
Upvotes: 0
Reputation: 323316
Using set
df.groupby('key').agg(lambda x : ','.join(set(x)))
Out[1255]:
v1 v2
key
1 n/a a
2 n/a b,d
3 n/a c,e
Upvotes: 3