Subsetting DataFrame based on column names of another DataFrame

Question

I have two DataFrames and I want to subset df2 based on the column names that intersect with the column names of df1. In R this is easy.

R code:

df1 <- data.frame(a=rnorm(5), b=rnorm(5))
df2 <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5))

df2[names(df2) %in% names(df1)]
           a          b
1 -0.8173361  0.6450052
2 -0.8046676  0.6441492
3 -0.3545996 -1.6545289
4  1.3364769 -0.4340254
5 -0.6013046  1.6118360

However, I'm not sure how to do this in pandas.

pandas attempt:

df1 = pd.DataFrame({'a': np.random.standard_normal((5,)), 'b': np.random.standard_normal((5,))})
df2 = pd.DataFrame({'a': np.random.standard_normal((5,)), 'b': np.random.standard_normal((5,)), 'c': np.random.standard_normal((5,))})

df2[df2.columns in df1.columns]

This results in TypeError: unhashable type: 'Index'. What's the right way to do this?

miradulo · Accepted Answer

If you need a true intersection, since .columns yields an Index object which supports basic set operations, you can use &, e.g.

df2[df1.columns & df2.columns]

or equivalently with Index.intersection

df2[df1.columns.intersection(df2.columns)]

However if you are guaranteed that df1 is just a column subset of df2 you can directly use

df2[df1.columns]

or if assigning,

df2.loc[:, df1.columns]

Demo

>>> df2[df1.columns & df2.columns]
          a         b
0  1.952230 -0.641574
1  0.804606 -1.509773
2 -0.360106  0.939992
3  0.471858 -0.025248
4 -0.663493  2.031343

>>> df2.loc[:, df1.columns]
          a         b
0  1.952230 -0.641574
1  0.804606 -1.509773
2 -0.360106  0.939992
3  0.471858 -0.025248
4 -0.663493  2.031343

Subsetting DataFrame based on column names of another DataFrame

Answers (2)

Related Questions