Edward Grefenstette
Edward Grefenstette

Reputation: 603

Select rows in one DataFrame based on rows in another

Let's assume I have a very large pandas DataFrame dfBig with columns Param1, Param2, ..., ParamN, score, step, and a smaller DataFrame dfSmall with columns Param1, Param2, ..., ParamN (i.e. missing the score and step columns).

I want to select all the rows of dfBig for which the values of columns Param1, Param2, ..., ParamN match those of some row in dfSmall. Is there a clean way of doing this in pandas?

Edit: To give an example, consider this DataFrame dfBig:

Arch  | Layers | Score | Time
A     | 1      | 0.3   | 10
A     | 1      | 0.6   | 20
A     | 1      | 0.7   | 30
A     | 2      | 0.4   | 10
A     | 2      | 0.5   | 20
A     | 2      | 0.6   | 30
B     | 1      | 0.1   | 10
B     | 1      | 0.2   | 20
B     | 1      | 0.7   | 30
B     | 2      | 0.7   | 10
B     | 2      | 0.8   | 20
B     | 2      | 0.8   | 30

Let's imagine a model is specified by a pair (Arch, Layers). I want to query dfBig and get the time series for scores over time for the best performing models with Arch A and Arch B.

Following EdChum's answer below, I take it that the best solution is to do something like this procedurally:

modelColumns = [col for col in dfBigCol if col not in ["Time", "Score"]]    
groupedBest = dfBig.groupby("Arch").Score.max()
dfSmall = pd.DataFrame(groupedBest).reset_index()[modelColumns]
dfBest = pd.merge(dfSmall, dfBig)

which yields:

Arch  | Layers | Score | Time
A     | 1      | 0.3   | 10
A     | 1      | 0.6   | 20
A     | 1      | 0.7   | 30
B     | 2      | 0.7   | 10
B     | 2      | 0.8   | 20
B     | 2      | 0.8   | 30

If there's a better way to do this, I'm happy to hear it.

Upvotes: 1

Views: 127

Answers (1)

EdChum
EdChum

Reputation: 394459

If I understand your question correctly you should be able to just call merge on dfBig and pass dfSmall which will look for matches in the aligned columns and only return those rows.

Example:

In [71]:

dfBig = pd.DataFrame({'a':np.arange(100), 'b':np.arange(100), 'c':np.arange(100)})
dfSmall = pd.DataFrame({'a':[3,4,5,6]})
dfBig.merge(dfSmall)
Out[71]:
   a  b  c
0  3  3  3
1  4  4  4
2  5  5  5
3  6  6  6

Upvotes: 1

Related Questions