Reputation: 77
I'm new to pandas and would like to check if there are more efficient way to do what I would like to achieve.
I have a customer master data change log for 12 months looks like below (the full file have 150K rows and 15 columns)
date1 | changed_by | customer | field_name |
---|---|---|---|
23/02/2021 18:59 | 101852 | 40001375 | Industry |
23/02/2021 18:59 | 101852 | 40001375 | Industry code 1 |
23/02/2021 18:59 | 101852 | 40001375 | Region |
30/12/2021 02:31 | 119186 | 30045194 | Postal Code |
30/12/2021 02:31 | 119186 | 30045194 | Industry code 1 |
30/12/2021 02:31 | 119186 | 30045194 | Region |
30/01/2021 03:31 | 109209 | 40001375 | Postal Code |
30/01/2021 03:31 | 109209 | 40001375 | Industry code 1 |
30/01/2021 03:31 | 109209 | 40001375 | Bank |
I would like to check if there are any customers' data being change more than once within the period, i.e. same customer ID but different date, then extract those line for further analysis.
Below are the code I currently have and seems work ok.
pvz = df3.groupby(['date1','customer'])['field_name'].count().reset_index()
pvz
Gives me
date1 | customer | field_name |
---|---|---|
23/02/2021 18:59 | 40001375 | 3 |
30/01/2021 03:31 | 40001375 | 3 |
30/12/2021 02:31 | 30045194 | 3 |
Then I:
pvz[pvz.duplicated(subset='customer',keep=False)]
Return the below that seems ok:
date1 | customer | field_name |
---|---|---|
23/02/2021 18:59 | 40001375 | 3 |
30/01/2021 03:31 | 40001375 | 3 |
As a beginner I would like to check
Thank you.
Upvotes: 0
Views: 184
Reputation: 120409
If you have one optimization you can do is to replace the groupyby_count
by value_counts
:
>>> df.value_counts(['date1', 'customer']).rename('count').reset_index() \
.loc[lambda x: x.duplicated('customer', keep=False)]
date1 customer count
0 23/02/2021 18:59 40001375 3
1 30/01/2021 03:31 40001375 3
Upvotes: 0
Reputation:
You can reduce it to one line by doing the following:
as_index=False
in your groupby
call instead of using reset_index()
.loc
filtered = df.groupby(['date1','customer'], as_index=False)['field_name'].count().loc[lambda x: x.duplicated(subset='customer',keep=False)]
Output:
>>> filtered
date1 customer field_name
0 23/02/2021 18:59 40001375 3
1 30/01/2021 03:31 40001375 3
Upvotes: 1