Reputation: 57
I'm new to python, and I wanted to sort some duplication data according to some column of data within a data frame set, for example
Import pandas as pd
df = pd.read_excel('Data.xlsx', index = ['ID']
df2 = df[df.duplicated(subset = ['A','B'], keep = False)]
print (df2)
Let's say the output will be like this
'ID'|'Name' |'A'|'B'|
1 | Ash | 1 | 1 |
2 | James | 1 | 1 |
3 | Ash | 1 | 1 |
4 | James | 1 | 1 |
5 | Ash | 2 | 1 |
6 | James | 1 | 1 |
7 | Ash | 2 | 1 |
I would like to have the output of data as below:
'Name' |'A'|'B'|'Pattern'|'Frequency of Pattern'|
Ash | 1 | 1 | 1 | 2 |
Ash | 2 | 1 | 2 | 2 |
James | 1 | 1 | 3 | 3 |
So far I haven't found any similar post yet
Upvotes: 2
Views: 31
Reputation: 862681
Use GroupBy.size
for count duplicates and then add new column to specific position by DataFrame.insert
:
df4 = df3.groupby(['Name','A','B']).size().reset_index(name='Frequency of Pattern')
df4.insert(3, 'Pattern', df4.index + 1)
print (df4)
Name A B Pattern Frequency of Pattern
0 Ash 1 1 1 2
1 Ash 2 1 2 2
2 James 1 1 3 3
Upvotes: 1