Vince
Vince

Reputation: 557

get unique column value by date in python

i've generated this dataframe:

np.random.seed(123)

len_df = 10
groups_list = ['A','B']
dates_list = pd.date_range(start='1/1/2020', periods=10, freq='D').to_list()

df2 = pd.DataFrame()
df2['date'] = np.random.choice(dates_list, size=len_df)
df2['value'] = np.random.randint(232, 1532, size=len_df)
df2['group'] = np.random.choice(groups_list, size=len_df)
df2 = df2.sort_values(by=['date'])
df2.reset_index(drop=True, inplace=True)

        date group  value
0 2020-01-01     A    652
1 2020-01-02     B   1174
2 2020-01-02     B   1509
3 2020-01-02     A    840
4 2020-01-03     A    870
5 2020-01-03     A    279
6 2020-01-04     B    456
7 2020-01-07     B    305
8 2020-01-07     A   1078
9 2020-01-10     A    343

I need to get rid of duplicated groups in the same date. I just want that one group appears only once in a date.

Result

        date group  value
0 2020-01-01     A    652
1 2020-01-02     B   1174
2 2020-01-02     A    840
3 2020-01-03     A    870
4 2020-01-04     B    456
5 2020-01-07     B    305
6 2020-01-07     A   1078
7 2020-01-10     A    343

Upvotes: 0

Views: 114

Answers (3)

guigomcha
guigomcha

Reputation: 78

you are looking for the drop_duplicates method on a dataframe.

df2 = df2.drop_duplicates(subset=['date', 'group'], keep='first').reset_index(drop=True)
 
       
         date  value group
0 2020-01-01    652     A
1 2020-01-02   1174     B
2 2020-01-02    840     A
3 2020-01-03    870     A
4 2020-01-04    456     B
5 2020-01-07    305     B
6 2020-01-07   1078     A
7 2020-01-10    343     A

Upvotes: 1

Ajay
Ajay

Reputation: 36

You can use drop_duplicates() to drop based on a subset of columns. However, you need to specify which row to keep e.g. first/last row.

df2 = df2.drop_duplicates(subset=['date', 'group'], keep='first')

Upvotes: 1

Andreas
Andreas

Reputation: 9197

.drop_duplicates() is in the pandas library and allows you to do exactly that. Read more in the documentation.

df2.drop_duplicates(subset=["date", "group"], keep="first")

Out[9]: 
        date group  value
0 2020-01-01     A    652
1 2020-01-02     B   1174
3 2020-01-02     A    840
4 2020-01-03     A    870
6 2020-01-04     B    456
7 2020-01-07     B    305
8 2020-01-07     A   1078
9 2020-01-10     A    343

Upvotes: 2

Related Questions