jimjames
jimjames

Reputation: 41

R merging two dataframes, but only select certain year rows from dateframe 2

I've got two dataframes, one with 2016-2020 and one with 2015-2020. I would like to extract the 2015 lines from dataframe2 and insert to dataframe1.

Dateframe1 has date, hits, keyword (same as dataframe2) The merged line must be matched by keywords. So 2015 "food" from dataframe2, must be inserted before 01.01.2016 "food" in dataframe1.

Ex: Dataframe1:  
*date hits keyword*  
2016-01-01 10 food  
2016-31-01 5 food  
2017-31-01 5 food  
2018-31-01 5 food  
2018-31-01 5 food   
2016-01-01 55 drink  
2016-22-01 1 drink    
2017-31-05 2 drink   
2018-31-01 1 drink    

So I want all lines in 2015 containing food to be inserted above 2016 food in dataframe1. And the same with drink. All drink 2015 from dataframe2 must be inserted before 2016 drink in dataframe1.

End result:

*date hits keyword*

**2015-31-01 5 food**  
2016-01-01 10 food  
2016-31-01 5 food  
2017-31-01 5 food  
2018-31-01 5 food  
2018-31-01 5 food   
**2015-31-01 7 food**  
2016-01-01 55 drink  
2016-22-01 1 drink    
2017-31-05 2 drink   
2018-31-01 1 drink  

Upvotes: 1

Views: 54

Answers (1)

r2evans
r2evans

Reputation: 160677

Three basic frame operations:

  1. Filter Dataframe2 to only include the rows we want.

    Dataframe2[grepl("^2015", Dataframe2$date),]
    #         date x    y
    # 2 2015-31-01 5 food
    # 3 2015-31-01 5 food
    # 4 2015-31-01 5 food
    
  2. Combine row-wise using rbind.

    Dataframe1 <- rbind(Dataframe2[grepl("^2015", Dataframe2$date),], Dataframe1)
    Dataframe1
    #          date  x     y
    # 2  2015-31-01  5  food
    # 3  2015-31-01  5  food
    # 4  2015-31-01  5  food
    # 1  2016-01-01 10  food
    # 21 2016-31-01  5  food
    # 31 2017-31-01  5  food
    # 41 2018-31-01  5  food
    # 5  2018-31-01  5  food
    # 6  2016-01-01 55 drink
    # 7  2016-22-01  1 drink
    # 8  2017-31-05  2 drink
    # 9  2018-31-01  1 drink
    
  3. Sort the resulting data.

    Dataframe1[order(Dataframe1$date),]
    #          date  x     y
    # 2  2015-31-01  5  food
    # 3  2015-31-01  5  food
    # 4  2015-31-01  5  food
    # 1  2016-01-01 10  food
    # 6  2016-01-01 55 drink
    # 7  2016-22-01  1 drink
    # 21 2016-31-01  5  food
    # 31 2017-31-01  5  food
    # 8  2017-31-05  2 drink
    # 41 2018-31-01  5  food
    # 5  2018-31-01  5  food
    # 9  2018-31-01  1 drink
    

I should note that these are all using string date values that will not sort correctly: they are sorting lexicographically, which is not numeric. Realize that

20 > 3
# [1] TRUE
"20" > "3"
# [1] FALSE

To do this right, the columns would be proper Date class columns:

# starting with a fresh `Dataframe1`
Dataframe1$date <- as.Date(Dataframe1$date, format = "%Y-%d-%m")
Dataframe2$date <- as.Date(Dataframe2$date, format = "%Y-%d-%m")

## Filter
lims <- as.Date(c("2015-01-01", "2015-31-12"), format = "%Y-%d-%m")
Dataframe2[ lims[1] <= Dataframe2$date & Dataframe2$date <= lims[2], ] # for demo

## Combine
Dataframe1 <- rbind(Dataframe2[ lims[1] <= Dataframe2$date & Dataframe2$date <= lims[2], ], Dataframe1)

## Order
Dataframe1[order(Dataframe1$date),]

Note that R will always show them in year-month-date order when in a Date class object. If you want it displayed something else, I suggest you do that only in report generation (using format(Dataframe1$date, format="..."), see ?strptime for format hints).

Upvotes: 1

Related Questions