Joshua Zecha
Joshua Zecha

Reputation: 141

Updating Data Frames

I have the following dataset, which originates from two datasets taken from an API at different points in time. df1 simply shows the state after I appended them. My goal is to generate the newest version of my API data, without forgetting the old data. This means I am looking to create some kind of update mechanism. I thought about creating a unique number for each dataset to identify its state, append the new version to the old one and then filter out the duplicates while keeping the newer data.

The data frames look like this:

df (after simply appending the two)

  "Year"    "Month"   "dataset"  
    2017    December      1
    2018    January       1 
    2018    January       2
    2018    February      1
    2018    February      2
    2018    March         2 
    2018    April         2 

df2 (the update)

  "Year"    "Month"   "dataset"  
    2017    December      1
    2018    January       2
    2018    February      2
    2018    March         2 
    2018    April         2 

As df2 shows, the update mechanism prefers the data from dataset 2. January and February data were in both data sets but only the data from February is kept.

On the other hand, if there is no overlap between the datasets, it keeps the old and the new data.

Is there a simple solution in order to create the described update mechanism in R?

This is the Code for df1:

df1 <- data.frame(Year = c(2017,2018,2018,2018,2018,2018,2018),
             Month = 
c("December","January","January","February","February","March","April"),
             Dataset = c(1,1,2,1,2,2,2))

Upvotes: 0

Views: 56

Answers (1)

divibisan
divibisan

Reputation: 12155

Let me see if I have this right: you have 2 datasets (named 1 and 2) which you want to combine. Currently, you're getting the format shown above as df but you want the output to be df2. Is this correct? The below code should solve your problem. It is important that your newer dataset appears first in the full_join call. Whichever appears first will be given priority by distinct when it decides which duplicated rows to remove.

library(dplyr)
df <- data.frame(Year = c(2017,2018,2018,2018,2018,2018,2018),
                  Month = c("December","January","January","February",
                            "February","March","April"),
                  Dataset = c(1,1,2,1,2,2,2))
df1 <- dfx[dfx$Dataset == 1,]
df2 <- dfx[dfx$Dataset == 2,]

df.updated <- dplyr::full_join(df2, df1) %>%
    distinct(Year, Month, .keep_all = TRUE)

df.updated

  Year    Month Dataset
1 2018  January       2
2 2018 February       2
3 2018    March       2
4 2018    April       2
5 2017 December       1

full_join joins the two data frames on matching variables, keeping all rows from both. Then distinct tosses out the duplicated rows. By specifying variable names in distinct, we tell it to only consider the values in Year and Month when determining uniqueness, so when a specific Year/Month combination appears in more than one dataset, only one row will be kept.

Normally, distinct only keeps the variables it uses to determine uniqueness. By providing the argument .keep_all = TRUE, it will keep all variables. When there are conflicts (for example, 2 rows from February 2018 with different values of Dataset) it will keep whichever row appears first in the data frame. This is why it's important for your newer dataset to appear first in the full_join: this gives rows that appear in df2 priority over rows that also appear in df1.

Upvotes: 1

Related Questions