Reputation: 309
I have two dataframes that share several common columns as follows :
the first one df1
:
TG_ID ORG_NO TG_CAP DATA_DATE METER_ID CONS_SORT WIRING_MODE T_FACTOR PHASE_FLAG I1 ... I87 I88 I89 I90 I91 I92 I93 I94 I95 I96
0 1646176 134010902 315.0 20200801 8200000036483255 06 3 60.0 1 0.606333 ... 0.480667 0.524333 0.399333 0.430667 0.386667 0.396667 0.371333 0.314333 0.336333 0.289000
1 1646176 134010902 315.0 20200801 8200000036483255 06 3 60.0 2 0.606333 ... 0.480667 0.524333 0.399333 0.430667 0.386667 0.396667 0.371333 0.314333 0.336333 0.289000
2 1646176 134010902 315.0 20200801 8200000036483255 06 3 60.0 3 0.606333 ... 0.480667 0.524333 0.399333 0.430667 0.386667 0.396667 0.371333 0.314333 0.336333 0.289000
6 1646178 134010902 200.0 20200801 8200000033404961 06 3 60.0 1 0.338333 ... 0.199333 0.215000 0.191667 0.248333 0.154000 0.240000 0.129333 0.224667 0.152000 0.198000
7 1646178 134010902 200.0 20200801 8200000033404961 06 3 60.0 2 0.338333 ... 0.199333 0.215000 0.191667 0.248333 0.154000 0.240000 0.129333 0.224667 0.152000 0.198000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1117 900028797849 134010902 400.0 20200801 8200000035893592 06 3 120.0 2 0.286667 ... 0.192000 0.196667 0.203667 0.171000 0.140333 0.154333 0.126000 0.144000 0.144667 0.131333
1118 900028797849 134010902 400.0 20200801 8200000035893592 06 3 120.0 3 0.286667 ... 0.192000 0.196667 0.203667 0.171000 0.140333 0.154333 0.126000 0.144000 0.144667 0.131333
1119 900029405841 134010902 100.0 20200801 8200000036482415 06 3 30.0 1 0.060000 ... 0.088000 0.127667 0.084333 0.084333 0.123333 0.088000 0.084000 0.084333 0.084333 0.083667
1120 900029405841 134010902 100.0 20200801 8200000036482415 06 3 30.0 2 0.060000 ... 0.088000 0.127667 0.084333 0.084333 0.123333 0.088000 0.084000 0.084333 0.084333 0.083667
1121 900029405841 134010902 100.0 20200801 8200000036482415 06 3 30.0 3 0.060000 ... 0.088000 0.127667 0.084333 0.084333 0.123333 0.088000 0.084000 0.084333 0.084333 0.083667
and the second one df2
:
TG_ID ORG_NO DATA_DATE METER_ID CONS_SORT WIRING_MODE PHASE_FLAG U1 U2 U3 ... U87 U88 U89 U90 U91 U92 U93 U94 U95 U96
27 1646176 134010901 20200801 8200000034727355 04 1 1 248.7 247.4 248.8 ... 251.0 251.4 251.5 246.6 248.1 247.8 248.0 247.8 248.0 246.8
28 1646176 134010902 20200801 8200000032252506 05 1 1 248.5 247.8 248.9 ... 251.0 251.5 249.3 246.8 248.2 247.6 247.2 247.9 249.2 246.9
30 1646176 134010902 20200801 8200000031675334 05 1 1 248.5 248.3 250.4 ... 253.0 251.9 248.4 249.1 248.5 247.6 247.6 246.9 245.9 247.5
32 1646176 134010902 20200801 8200000032855108 05 1 1 240.9 241.1 242.8 ... 248.8 248.8 246.1 245.9 246.7 247.0 245.2 246.9 244.4 243.6
34 1646176 134010902 20200801 8200000032861235 05 1 1 241.8 242.7 243.9 ... 249.8 249.5 246.4 246.2 246.9 247.0 245.7 247.6 244.7 244.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
28323 900028797849 134010902 20200801 8200000032252390 05 1 1 229.6 231.3 231.2 ... 236.0 234.4 232.5 232.7 231.8 233.3 233.1 232.4 229.5 229.3
28324 900028797849 134010902 20200801 8200000035146454 05 1 1 233.2 232.5 233.1 ... 236.8 237.3 237.5 233.8 234.2 233.4 233.6 232.3 233.3 233.7
28325 900028797849 134010902 20200801 8200000035113338 05 1 1 233.1 232.6 233.0 ... 237.0 237.2 234.7 233.7 234.3 233.4 233.6 232.2 232.7 233.8
28326 900028797849 134010902 20200801 8200000035132418 05 1 1 233.7 233.4 233.8 ... 237.3 237.7 234.9 234.3 234.4 233.9 233.6 232.0 233.2 233.9
28327 900028797849 134010902 20200801 8200000035138508 05 1 1 231.9 233.3 233.4 ... 237.7 236.1 234.3 234.1 233.5 233.8 233.6 233.8 231.3 231.1
I want to keep rows in df2
whose values in columns "TG_ID", "ORG_NO" and "DATA_DATE"
are also present in df1
. For example, 27-th row of df2 has value "1646176", "134010901" and "20200801"
for columns "TG_ID", "ORG_NO" and "DATA_DATE"
, these values are not all present in df1
(since df1 only has value "134010902"
for the column "ORG_NO"
), so 27-th row of df2 should be filtered out. On the other hand, 28-th row of df2
should be kept since value "1646176", "134010902" and "20200801"
are also presented in df1
.
I´ve tried merge like following :
df3 = df2.merge(df1, on = ["TG_ID", "ORG_NO", "DATA_DATE"])
but it gives more rows and columns than df2
, which is not expected.
And the following call to join() simply gives the error
len(left_on) must equal the number of levels in the index of "right"
df3 = df2.join(df1, on = ["TG_ID", "ORG_NO", "DATA_DATE"])
How can i achieve my requirement properly?
Upvotes: 1
Views: 770
Reputation: 71689
You can set the columns TG_ID
, ORG_NO
and DATA_DATE
as a multiindex in both the dataframes df1
and df2
then using MultiIndex.isin
test the membership for index of df2
in the index of dataframe df1
to create a boolean mask
:
c = ["TG_ID", "ORG_NO", "DATA_DATE"]
mask = df2.set_index(c).index.isin(df1.set_index(c).index)
>>> df2[mask]
TG_ID ORG_NO DATA_DATE METER_ID CONS_SORT WIRING_MODE PHASE_FLAG U1 U2 U3 ... U87 U88 U89 U90 U91 U92 U93 U94 U95 U96
28 1646176 134010902 20200801 8200000032252506 05 1 1 248.5 247.8 248.9 ... 251.0 251.5 249.3 246.8 248.2 247.6 247.2 247.9 249.2 246.9
30 1646176 134010902 20200801 8200000031675334 05 1 1 248.5 248.3 250.4 ... 253.0 251.9 248.4 249.1 248.5 247.6 247.6 246.9 245.9 247.5
32 1646176 134010902 20200801 8200000032855108 05 1 1 240.9 241.1 242.8 ... 248.8 248.8 246.1 245.9 246.7 247.0 245.2 246.9 244.4 243.6
34 1646176 134010902 20200801 8200000032861235 05 1 1 241.8 242.7 243.9 ... 249.8 249.5 246.4 246.2 246.9 247.0 245.7 247.6 244.7 244.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
28323 900028797849 134010902 20200801 8200000032252390 05 1 1 229.6 231.3 231.2 ... 236.0 234.4 232.5 232.7 231.8 233.3 233.1 232.4 229.5 229.3
28324 900028797849 134010902 20200801 8200000035146454 05 1 1 233.2 232.5 233.1 ... 236.8 237.3 237.5 233.8 234.2 233.4 233.6 232.3 233.3 233.7
28325 900028797849 134010902 20200801 8200000035113338 05 1 1 233.1 232.6 233.0 ... 237.0 237.2 234.7 233.7 234.3 233.4 233.6 232.2 232.7 233.8
28326 900028797849 134010902 20200801 8200000035132418 05 1 1 233.7 233.4 233.8 ... 237.3 237.7 234.9 234.3 234.4 233.9 233.6 232.0 233.2 233.9
28327 900028797849 134010902 20200801 8200000035138508 05 1 1 231.9 233.3 233.4 ... 237.7 236.1 234.3 234.1 233.5 233.8 233.6 233.8 231.3 231.1
Upvotes: 2