Reputation: 4292
I have a large Pandas dataframe (> 1 million rows) that I have retrieved from a SQL Server database. In a small number of cases, some of the records have duplicate entries. All cells are identical except for a single, text field. It looks as though the record has been entered into the database and then, at a later time, additional text has been added to the field and the record stored in the database as a separate entry. So basically, I want to keep only the record with the longest text string. A simplified version of the database can be created as follows:
tempDF = pd.DataFrame({ 'recordID': [1,2,3,3,4,5,6,6,6,7,7,8,9,10],
'text': ['abc', 'def', 'ghi', 'ghijkl', 'mto', 'per', 'st', 'stuvw', 'stuvwx', 'yz', 'yzab', 'cde', 'fgh', 'ijk']})
Which looks like this:
recordID text
0 21 abc
1 22 def
2 23 ghi
3 23 ghijkl
4 24 mno
5 25 pqr
6 26 st
7 26 stuvw
8 26 stuvwx
9 27 yz
10 27 yzab
11 28 cde
12 29 fgh
13 30 ijk
So far, I've identified the rows with duplicate recordID and calculated the length of the text field:
tempDF['dupl'] = tempDF.duplicated(subset = 'recordID',keep=False)
tempDF['texLen'] = tempDF['text'].str.len()
print(tempDF)
To produce:
recordID text dupl texLen
0 21 abc False 3
1 22 def False 3
2 23 ghi True 3
3 23 ghijkl True 6
4 24 mno False 3
5 25 pqr False 3
6 26 st True 2
7 26 stuvw True 5
8 26 stuvwx True 6
9 27 yz True 2
10 27 yzab True 4
11 28 cde False 3
12 29 fgh False 3
13 30 ijk False 3
I can groupby all the dupl==True records based on recordID using:
tempGrouped = tempDF[tempDF['dupl']==True].groupby('recordID')
And print off each group separately:
for name, group in tempGrouped:
print('n',name)
print(group)
23
recordID text dupl texLen
2 23 ghi True 3
3 23 ghijkl True 6
26
recordID text dupl texLen
6 26 st True 2
7 26 stuvw True 5
8 26 stuvwx True 6
27
recordID text dupl texLen
9 27 yz True 2
10 27 yzab True 4
I want the final dataframe to consist of those records where dupl==False and, if dupl==True then only the replicate with the longest text field should be retained. So, the final dataframe should look like:
recordID text dupl texLen
0 21 abc False 3
1 22 def False 3
3 23 ghijkl True 6
4 24 mno False 3
5 25 pqr False 3
8 26 stuvwx True 6
10 27 yzab True 4
11 28 cde False 3
12 29 fgh False 3
13 30 ijk False 3
How can I delete from the original dataframe only those rows where recordID is duplicated and where texLen is less than the maximum?
Upvotes: 1
Views: 3688
Reputation: 862581
You can try find indexes with max values by idxmax
, concat
with False
values in dupl
column and last sort_index
:
idx = tempDF[tempDF['dupl']==True].groupby('recordID')['texLen'].idxmax()
print tempDF.loc[idx]
recordID text dupl texLen
3 23 ghijkl True 6
8 26 stuvwx True 6
10 27 yzab True 4
print pd.concat([tempDF[tempDF['dupl']==False], tempDF.loc[idx]]).sort_index(0)
recordID text dupl texLen
0 21 abc False 3
1 22 def False 3
3 23 ghijkl True 6
4 24 mto False 3
5 25 per False 3
8 26 stuvwx True 6
10 27 yzab True 4
11 28 cde False 3
12 29 fgh False 3
13 30 ijk False 3
The simplier solution use sort_values
and first
, because rows with False
have unique recordID
(are NOT duplicated):
df=tempDF.sort_values(by="texLen", ascending=False).groupby("recordID").first().reset_index()
print df
recordID text dupl texLen
0 21 abc False 3
1 22 def False 3
2 23 ghijkl True 6
3 24 mto False 3
4 25 per False 3
5 26 stuvwx True 6
6 27 yzab True 4
7 28 cde False 3
8 29 fgh False 3
9 30 ijk False 3
Upvotes: 1