Yang Yang
Yang Yang

Reputation: 902

How to subset a dataframe based on columns from another dataframe?

I have two data frames (df1 and df2) and I want to subset df2 based on the first two columns contained in df1. For example,

df1 = data.frame(x=c(1,1,1,1,1),y=c(1,2,3,4,5),value=c(3,4,5,6,7))
df2 = data.frame(x=c(1,1,1,1,1,2), y=c(5,3,4,2,1,6), value=c(8,9,10,11,12,13))

enter image description here

As we can see, row 6 (2,6) in df2 is not included in df1, so I will just select row 1 to row 5 in df2.

Also, I want to rearrange df2 based on df1. The final result should be like this:

enter image description here

Thanks for any help.

Upvotes: 1

Views: 2856

Answers (2)

IceCreamToucan
IceCreamToucan

Reputation: 28705

When using merge, by default the data frames are joined by the variables they have in common, and the results are sorted. So you can do:

merge(df2, df1[c('x', 'y')])

#   x y value
# 1 1 1    12
# 2 1 2    11
# 3 1 3     9
# 4 1 4    10
# 5 1 5     8

To sort by the order of df1, use @Mankind_008's method

merge(df1[c('x','y')], df2 , sort = F)

Example:

set.seed(0)
df1 <- df1[sample(seq_len(nrow(df1))),]
df2 <- df2[sample(seq_len(nrow(df2))),]
df1
#   x y value
# 5 1 5     7
# 2 1 2     4
# 4 1 4     6
# 3 1 3     5
# 1 1 1     3    
merge(df1[c('x','y')], df2 , sort = F)
#   x y value
# 1 1 5     8
# 2 1 2    11
# 3 1 4    10
# 4 1 3     9
# 5 1 1    12

Upvotes: 3

Carlos Eduardo Lagosta
Carlos Eduardo Lagosta

Reputation: 1001

Use data tables:

library(data.table)

Create your data as data.table:

df1 <- data.table( x = c(1,1,1,1,1), y = c(1,2,3,4,5), value = c(3,4,5,6,7) )
df2 <- data.table( x = c(1,1,1,1,1,2), y = c(5,3,4,2,1,6), value = c(8,9,10,11,12,13) )

Or convert your existing data.frames:

df1 <- as.data.table( df1 )
df2 <- as.data.table( df2 )

Then:

df2[ df1, on = .(x,y) ]

Any column in df1 that have the same name in df2 will be renamed as i.columnname:

   x y value i.value
1: 1 1    12       3
2: 1 2    11       4
3: 1 3     9       5
4: 1 4    10       6
5: 1 5     8       7

Note that it already order by x and y. If you want to order by the column 'value' (or any other):

df2[ df1, on = .(x,y) ][ order(value) ]

The advantage of using data.table (or dplyr, as the solution proposed by AntoniosK) is that you can keep the two data sets separated.

Upvotes: 1

Related Questions