Reputation: 101
I am trying to do the below, I had originally posted a simpler version of this thinking it would generalize but have now realized it won't and so am reposting the problem here
The original question (and solutions) can be found here: Matching data from one data frame to another
I have two data frames, dfa and dfb
IDa <- c(1,2,3)
score1a <- c(5,10,1)
score2a <- c(NA,8,NA)
score3a <- c(NA,NA,13)
score1b <- c(NA,4,9)
score2b <- c(2,3,NA)
score2c <- c(1,5,1)
score3c <- c(6,NA,1)
dfa <- data.frame(IDa,score1a,score2a,score3a,score1b,score2b,score2c,score3c)
IDb <- c(1,1,1,2,2,3)
timeb <- c(1,2,3,2,3,3)
dfb <- data.frame(IDb,timeb)
In score1a, the '1' represents timeb = 1 in dfb and 'a' represents the first test type (hence there are 3 types of test, a,b,c and 3 timepoints 1,2,3)
I want to take the data from dfa and add this to dfb to create something like dfc below (note the first two columns of dfc are identical to dfb)
IDc <- c(1,1,1,2,2,3)
timec <- c(1,2,3,2,3,3)
scorea <- c(5,NA,NA,8,NA,13)
scoreb <- c(NA,2,NA,3,NA,NA)
scorec <- c(NA,1,6,5,NA,1)
dfc <- data.frame(IDc, timec, scorea, scoreb, scorec)
Hope that makes sense, many thanks for any help with this!
Upvotes: 2
Views: 147
Reputation: 193527
Here's an alternative using merged.stack
from my "splitstackshape" package along with merge
.
Generally, reshape-related functions in R seem to like the names to be in the form of "type" + "time" (your variables are currently in the form of "time" + "type"). We can easily rename the columns to the desired form with setnames
from "data.table" (which is loaded along with "splitstackshape").
library(splitstackshape)
setnames(dfa, gsub("(score)(\\d)([a-z])", "\\3_\\2", names(dfa)))
Once the names are correct, we stack the relevant columns and merge the results with your second dataset. A conversion to numeric was required to make the merge take place on the same types of data.
setkey(
merged.stack(dfa, var.stubs = c("^a", "^b", "^c"),
sep = "_")[, .time_1 := as.numeric(.time_1)],
IDa, .time_1)[setkeyv(as.data.table(dfb), names(dfb))]
# IDa .time_1 ^a ^b ^c
# 1: 1 1 5 NA NA
# 2: 1 2 NA 2 1
# 3: 1 3 NA NA 6
# 4: 2 2 8 3 5
# 5: 2 3 NA NA NA
# 6: 3 3 13 NA 1
Upvotes: 2
Reputation: 101
Similar to @beginneR answer above but avoiding using the grouping/summarise_each:
library(tidyr)
library(dplyr)
colnames(dfa)[-1] <- c("scorea1","scorea2","scorea3","scoreb1","scoreb2","scorec2","scorec3")
dfa %>%
gather(name, score, scorea1:scorec3) %>%
separate(variable, c("score","time"), 6) %>%
mutate(time = as.numeric(time)) %>%
spread(score, value) %>%
left_join(dfb, ., by= c("IDb"="IDa", "timeb"="time"))
Upvotes: 0
Reputation: 70266
Here's an option using dplyr and tidyr:
require(dplyr)
require(tidyr)
gather(dfa, xx, timea, -IDa) %>%
mutate(xx = as.character(xx),
x = gsub("[0-9]", "", xx)) %>%
spread(x, timea) %>%
mutate(xx = as.numeric(gsub("[a-zA-Z]", "", xx))) %>%
group_by(IDa, xx) %>%
summarise_each(funs(first(.[!is.na(.)]))) %>%
left_join(dfb, ., by = c("IDb" = "IDa", "timeb" = "xx"))
# IDb timeb scorea scoreb scorec
#1 1 1 5 NA NA
#2 1 2 NA 2 1
#3 1 3 NA NA 6
#4 2 2 8 3 5
#5 2 3 NA NA NA
#6 3 3 13 NA 1
The following steps are made (per line of code):
.[!is.na(.)]
removes all NA entries from the data and then the first()
function wrapped around it, takes the first element of the data without NAs. And generally, summarise
and summarise_each
will break the data down to 1 row for each group (which in this case will hold the first non-NA entry).Edit 2
Here are some examples to better understand what the first(.[!is.na(.)])
part does. Just remember that in the code, the .
represents the grouped data that is passed to the function (equivalent to what I call x
in the examples below).
set.seed(99)
x <- sample(10) #create a vector with random numbers
x
#[1] 6 2 10 7 4 5 3 1 8 9
x[sample(10, 4, replace = F)] <- NA # add some NAs
x
#[1] 6 NA 10 7 NA NA 3 1 NA 9
is.na(x) # is the value in each in index/place of x equal to NA?
#[1] FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
x[is.na(x)] # show me the values of x which are NA (of course, they are NA)
#[1] NA NA NA NA
x[!is.na(x)] # show me the values of x which are not NA (== remove NAs)
#[1] 6 10 7 3 1 9
dplyr::first(x[!is.na(x)]) # of all the values in x which are not NA, return the first one
#[1] 6
x[!is.na(x)][1] # this is equivalent to the previous line but using [1] instead of first()
#[1] 6
head(x[!is.na(x)], 1) # this is also equivalent of the two previous lines but using head(..., 1)
#[1] 6
Hope that helps.
Upvotes: 2