Thomas
Thomas

Reputation: 2534

Matching data between multiple columns of 2 dataframes to return "matched" value or mean based on one or two matched columns

I have a complex problem and am not sure how to proceed. I have two dataframes called df1:

structure(list(State = structure(1:2, .Label = c("Aaa", "Dd"), class = "factor"), 
City = structure(1:2, .Label = c("bb", "e"), class = "factor"), 
Type1 = c(NA, NA), Type2 = c(NA, NA)), .Names = c("State", 
"City", "Type1", "Type2"), class = "data.frame", row.names = c(NA, 
-2L))

and df2:

structure(list(state = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("Aaa", "Dd"), class = "factor"), city = structure(c(1L, 
2L, 3L, 4L, 4L, 5L, 6L), .Label = c("bb", "ccc", "ddd", "fff", 
"ggg", "hh"), class = "factor"), type = structure(c(1L, 2L, 2L, 
2L, 2L, 2L, 3L), .Label = c("Type 1", "Type 2", "Type 4"), class = "factor"), 
value = 1:7), .Names = c("state", "city", "type", "value"
), class = "data.frame", row.names = c(NA, -7L))

Dataframe df1 looks like:

State City Type1 Type2
Aaa   bb    NA    NA
Dd    e    NA    NA

and dataframe df2 looks like:

state city   type value
Aaa   bb Type 1     1
Aaa  ccc Type 2     2
Aaa  ddd Type 2     3
Dd  fff Type 2     4
Dd  fff Type 2     5
Dd  ggg Type 2     6
Dd   hh Type 4     7

For the NA in df1, I need to look up the values from df2 according to the following rules:

1) If there is only a single instance where State = state and City = city for a given type in df2, insert value into the appropriate df1 column Type1 or Type2

2) When there are multiple instances where State = state and City = city for a given type, I need to average all value and insert this into df1

3) If there are no instances where State = state and City = city for a given type, I need to obtain the average of all state for this type and insert into df1

4) If there are no instances where State = state for a given type, then the value should remain NA in df1

Just to clarify -- essentially I am trying to average values Type1 and Type2 as "resolved" as possible. In other words, I want to use averages at the City level when possible, but when that is not possible, then I want to use State level averages. However, I want to return these averages for the original State and City outlined in df1 (even when State averages are all that is available

I know that is very complicated! The result I am trying to get is

structure(list(State = structure(1:2, .Label = c("Aaa", "Dd"), class = "factor"), 
City = structure(1:2, .Label = c("bb", "e"), class = "factor"), 
Type1 = c(1L, NA), Type2 = c(2.5, 5)), .Names = c("State", 
"City", "Type1", "Type2"), class = "data.frame", row.names = c(NA, 
-2L))

which is a dataframe like:

State City Type1 Type2
Aaa   bb     1   2.5
Dd    e    NA   5.0

I don't even know where to begin solving this. My first thought is that I need to use acast to reshape df2. For example, I can use

acast(df2, state+city+value~type)

which reshapes the data to more closely resemble df1, but then I loose some columns that I need to preserve (these get smooshed into the rowname). I don't even know how to begin the challenge of searching for City and State and then doing the averaging depending on those results.

Can anyone point me in the right direction?

EDIT (Jan 2015): I added a new comment below Troy's answer below, asking if there was an easy way to add a column identifying the level at which means are calculated (City or State). I've found a solution and although there's likely a better way, it works for me. Hope this helps someone!

getlevel<-function(state,city,type){
m<-means[means$state==state & means$city==city & means$type==type, "mean"]
sm<-state_means[state_means$state==state & state_means$type==type, "mean"]
ifelse(length(m)>0,"city","state")
}

then

ddply(df1,.(State,City),transform,Type1=getval(as.character(State),as.character(City),"Type 1"),Type1Level=getlevel(as.character(State),as.character(City),"Type 1"),Type2=getval(as.character(State),as.character(City),"Type 2"),Type2Level=getlevel(as.character(State),as.character(City),"Type 2"))

Upvotes: 1

Views: 1302

Answers (2)

Ricardo Saporta
Ricardo Saporta

Reputation: 55420

First reshape the data in df2, then use data.table's key to merge the data appropriately:

library(data.table)
library(reshape2)

dt1 <- as.data.table(df1)
dt2 <- as.data.table(df2)

First, fix the Type column of dt2 by reshaping

dt2.casted <- reshape2::dcast(dt2, state + city ~ type
                              , fill=NA_real_
                              , fun.aggregate=mean, na.rm=TRUE)
dt2.casted <- as.data.table(dt2.casted)

Next, set the keys so you can merge

setkey(dt2.casted, state, city)
setkey(dt1, State, City)

Finally, merge and take the mean, knocking out NAs

dt1[dt2.casted][, lapply(.SD, mean, na.rm=TRUE), by=State, .SDcols=grep("Type", names(dt2.casted), value=TRUE)]

   State Type 1 Type 2 Type 4
1:   Aaa      1   2.50    NaN
2:    Dd    NaN   5.25      7

Alternate, based on comments (no "city" aggregation)

dt2.casted <- reshape2::dcast(dt2, state ~ type
                              , fill=NA_real_
                              , fun.aggregate=mean, na.rm=TRUE)
dt2.casted <- as.data.table(dt2.casted)

setkey(dt2.casted, state)
setkey(dt1, State)

dt1[dt2.casted][, lapply(.SD, mean, na.rm=TRUE)
                , by=list(State, City)
                , .SDcols=grep("Type"
                , names(dt2.casted), value=TRUE)
                ]

   State City Type 1 Type 2 Type 4
1:   Aaa   bb      1    2.5    NaN
2:    Dd    e    NaN    5.0      7

Upvotes: 0

Troy
Troy

Reputation: 8691

EDIT - sorry misread the questions: here is the corrected code for your conditions:

require(plyr)
means<-ddply(df2,.(state,city,type),summarize,mean=mean(value))
state_means<-ddply(df2,.(state,type),summarize,mean=mean(value))
getval<-function(state,city,type){
  m<-means[means$state==state & means$city==city & means$type==type, "mean"]
  sm<-state_means[state_means$state==state & state_means$type==type, "mean"]
  ifelse(length(m)>0,m,sm)
}
## this gives you the new df1
ddply(df1,.(State,City),transform,Type1=getval(as.character(State),as.character(City),"Type 1"),Type2=getval(as.character(State),as.character(City),"Type 2"))

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX PREVIOUS ANSWER (INCOMPLETE)

It's a bit difficult because your structure call doesn't work properly for df2, and your example datasets don't give you all the data in your expected result, but I think what you want is:

require(plyr)
means<-ddply(df2,.(state,city,type),summarize,mean=mean(value))
getval<-function(state,city,type){means[means$state==state & means$city==city & means$type==type, "mean"]}
## this gives you the new df1
ddply(df1,.(State,City),transform,Type1=getval(as.character(State),as.character(City),"Type 1"),Type2=getval(as.character(State),as.character(City),"Type 2"))

############################################################################X

## what's happening in detail:
require(plyr)                      # calls the plyr library
means<-ddply(df2,                  # base on df2 
             .(state,city,type),   # summarize by combination of city/state/type
             summarize,            # tells plyr to summarize rather than transform
             mean=mean(value))     # show one column at each summary level, called 'mean', the average val

getval<-function(state,city,type){     # create function called getval, takes 3 parameters
  means[means$state==state &           # first part of [X,]
          means$city==city &           # selects the row that matches all criteria
          means$type==type,            
        "mean"]}                       # and [,X] the column relating to the type

getval("Aaa","bb","Type 2")
# this gives you the new df2
ddply(df1,                             # base on df1
      .(State,City),                   # summarize by State & City
      transform,                       # tell plyr to transform existingn set rather than roll up
      Type1=getval(as.character(State),as.character(City),"Type 1"),   # call getval() for Type 1
      Type2=getval(as.character(State),as.character(City),"Type 2"))   # and for Type 2

which gives you the following (not your expected result, but what is implied by the data)

  State City Type1 Type2
1   Aaa   bb     1    NA
2    Dd    e    NA    NA

Upvotes: 1

Related Questions