Reputation: 603
Let's assume I have a very large pandas DataFrame dfBig
with columns Param1, Param2, ..., ParamN, score, step
, and a smaller DataFrame dfSmall
with columns Param1, Param2, ..., ParamN
(i.e. missing the score
and step
columns).
I want to select all the rows of dfBig
for which the values of columns Param1, Param2, ..., ParamN
match those of some row in dfSmall
. Is there a clean way of doing this in pandas?
Edit: To give an example, consider this DataFrame dfBig
:
Arch | Layers | Score | Time
A | 1 | 0.3 | 10
A | 1 | 0.6 | 20
A | 1 | 0.7 | 30
A | 2 | 0.4 | 10
A | 2 | 0.5 | 20
A | 2 | 0.6 | 30
B | 1 | 0.1 | 10
B | 1 | 0.2 | 20
B | 1 | 0.7 | 30
B | 2 | 0.7 | 10
B | 2 | 0.8 | 20
B | 2 | 0.8 | 30
Let's imagine a model is specified by a pair (Arch, Layers)
. I want to query dfBig
and get the time series for scores over time for the best performing models with Arch A and Arch B.
Following EdChum's answer below, I take it that the best solution is to do something like this procedurally:
modelColumns = [col for col in dfBigCol if col not in ["Time", "Score"]]
groupedBest = dfBig.groupby("Arch").Score.max()
dfSmall = pd.DataFrame(groupedBest).reset_index()[modelColumns]
dfBest = pd.merge(dfSmall, dfBig)
which yields:
Arch | Layers | Score | Time
A | 1 | 0.3 | 10
A | 1 | 0.6 | 20
A | 1 | 0.7 | 30
B | 2 | 0.7 | 10
B | 2 | 0.8 | 20
B | 2 | 0.8 | 30
If there's a better way to do this, I'm happy to hear it.
Upvotes: 1
Views: 127
Reputation: 394459
If I understand your question correctly you should be able to just call merge
on dfBig
and pass dfSmall
which will look for matches in the aligned columns and only return those rows.
Example:
In [71]:
dfBig = pd.DataFrame({'a':np.arange(100), 'b':np.arange(100), 'c':np.arange(100)})
dfSmall = pd.DataFrame({'a':[3,4,5,6]})
dfBig.merge(dfSmall)
Out[71]:
a b c
0 3 3 3
1 4 4 4
2 5 5 5
3 6 6 6
Upvotes: 1