Rollo99
Rollo99

Reputation: 1613

How to delete rows in a dataframe that correspond to missing rows in another dataframe?

I have two dataframes with two columns each (Date and data). The lenght of the columns differs. What I want to do is to delete the rows in df1 that are not in df2 by Date.

An example will clarify. These are my dataframes:

df1 = cbind(data.frame(Date = seq(as.Date("2018-11-1"), as.Date("2020-02-1"), by = "months"), stringsAsFactors = F), data.frame(Data = rnorm(16, 0, 1), stringsAsFactors = F))

         Date        Data
1  2018-11-01  1.09433662
2  2018-12-01 -0.27538189
3  2019-01-01 -0.19712728
4  2019-02-01  0.99852535
5  2019-03-01 -0.50760024
6  2019-04-01 -0.43127396
7  2019-05-01  0.90685965
8  2019-06-01  0.51510503
9  2019-07-01 -0.39070644
10 2019-08-01  1.27976428
11 2019-09-01 -0.63845519
12 2019-10-01 -0.05489751
13 2019-11-01 -0.87745923
14 2019-12-01  0.18082375
15 2020-01-01  0.08852416
16 2020-02-01  1.50827788

df2= cbind(data.frame(Date = df1$Date[c(1:5,7:9,11:13,15:16)]), data.frame(Data = c(1.09433662,-0.27538189, 0.99852535,-0.50760024,-0.43127396, 0.90685965,-0.39070644, 1.27976428,-0.63845519,-0.05489751,-0.87745923, 0.18082375, 1.50827788)))


         Date        Data
1  2018-11-01  1.09433662
2  2018-12-01 -0.27538189
3  2019-01-01  0.99852535
4  2019-02-01 -0.50760024
5  2019-03-01 -0.43127396
6  2019-05-01  0.90685965
7  2019-06-01 -0.39070644
8  2019-07-01  1.27976428
9  2019-09-01 -0.63845519
10 2019-10-01 -0.05489751
11 2019-11-01 -0.87745923
12 2020-01-01  0.18082375
13 2020-02-01  1.50827788

What I want now is that df1 is reduced to the same length as df2 by deleting the rows that are not in df2. The rows to be deleted correspond to the missing months in df2.

The result would be this for df1:

#df1 where the rows corresponding to the missing months in df2 have been deleted

     Date         Data

1  2018-11-01  1.09433662
2  2018-12-01 -0.27538189
3  2019-01-01 -0.19712728
4  2019-02-01  0.99852535
5  2019-03-01 -0.50760024
6  2019-05-01  0.90685965
7  2019-06-01  0.51510503
8  2019-07-01 -0.39070644
9 2019-09-01 -0.63845519
10 2019-10-01 -0.05489751
11 2019-11-01 -0.87745923
12 2020-01-01  0.08852416
13 2020-02-01  1.50827788

Can anyone help me?

Thanks a lot!

Upvotes: 0

Views: 84

Answers (1)

phiver
phiver

Reputation: 23608

semi_join from dplyr does what you are looking for. Note that your copied the data from df2 as the output example.

library(dplyr)

semi_join(df1, df2, by = "Date")

         Date        Data
1  2018-11-01  0.38376758
2  2018-12-01 -0.28738352
3  2019-01-01  1.79556305
4  2019-02-01 -0.34680836
5  2019-03-01  0.57803280
6  2019-05-01  1.96801082
7  2019-06-01  0.38448708
8  2019-07-01  0.39829417
9  2019-09-01  0.94912096
10 2019-10-01 -0.04469681
11 2019-11-01  0.32008546
12 2020-01-01  1.09054839
13 2020-02-01 -1.45438502

and anti_join shows the records that should be removed.

anti_join(df1, df2, by = "Date")

        Date       Data
1 2019-04-01  2.1303783
2 2019-08-01  1.6907800
3 2019-12-01 -0.8593388

Upvotes: 2

Related Questions