Reputation: 527
I have a DF with the layout of below. The actual table is in the range of +10m rows so on the larger side.
df = pd.DataFrame({'id1' : ['12a', '13b', '14c', '15d', '16e', '18g', '17f', '19h']
,'id2' : ['16e', '17f', '18g', '19h','12a', '14c','13b', '15d']
,'var1' : [i for i in range(8)]
,'var2' : list(np.random.randint(100, size = 8))
,'var3' : [1, 2, np.nan, 3, 2, np.nan, np.nan, 34]
})
>>> df
id1 id2 var1 var2 var3
0 12a 16e 0 66 1.0
1 13b 17f 1 9 2.0
2 14c 18g 2 48 NaN
3 15d 19h 3 13 3.0
4 16e 12a 4 67 2.0
5 18g 14c 5 88 NaN
6 17f 13b 6 92 NaN
7 19h 15d 7 99 34.0
What I want is to reduce the IDs in the DF, which goes in pairs, e.g. id1 index row 0 with id2 index row 4. All ids exist on two rows and I need to check every row and remove one of the rows. Currently I have a solution-ish which is row itteration and a bit slow.
Please note that one cant just drop the lower part of the DF (index row 4 and onwards) since all IDs need to be checked (id1 vs id2).
Final table will look like this:
id1 id2 var1 var2 12a 16e 66 1.0 13b 17f 9 2.0 14c 18g 48 NaN 15d 19h 13 3.0
All "speedy" solutions are highly appriciated.
Upvotes: 1
Views: 59
Reputation: 863291
I believe is possible sort columns per rows and filter only first rows by DataFrame.duplicated
with boolean indexing
and inverting mask by ~
:
np.random.seed(2018)
df = pd.DataFrame({'id1' : ['12a', '13b', '14c', '15d', '16e', '18g', '17f', '19h']
,'id2' : ['16e', '17f', '18g', '19h','12a', '14c','13b', '15d']
,'var1' : [i for i in range(8)]
,'var2' : list(np.random.randint(100, size = 8))
,'var3' : [1, 2, np.nan, 3, 2, np.nan, np.nan, 34]
})
df = df[~pd.DataFrame(np.sort(df[['id1', 'id2']], 1)).duplicated()]
print (df)
id1 id2 var1 var2 var3
0 12a 16e 0 62 1.0
1 13b 17f 1 59 2.0
2 14c 18g 2 58 NaN
3 15d 19h 3 72 3.0
Details:
print (pd.DataFrame(np.sort(df[['id1', 'id2']], 1)))
0 1
0 12a 16e
1 13b 17f
2 14c 18g
3 15d 19h
4 12a 16e
5 14c 18g
6 13b 17f
7 15d 19h
print (~pd.DataFrame(np.sort(df[['id1', 'id2']], 1)).duplicated())
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 False
dtype: bool
Upvotes: 1