Reputation: 2534
I have a complex problem and am not sure how to proceed. I have two dataframes called df1
:
structure(list(State = structure(1:2, .Label = c("Aaa", "Dd"), class = "factor"),
City = structure(1:2, .Label = c("bb", "e"), class = "factor"),
Type1 = c(NA, NA), Type2 = c(NA, NA)), .Names = c("State",
"City", "Type1", "Type2"), class = "data.frame", row.names = c(NA,
-2L))
and df2
:
structure(list(state = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("Aaa", "Dd"), class = "factor"), city = structure(c(1L,
2L, 3L, 4L, 4L, 5L, 6L), .Label = c("bb", "ccc", "ddd", "fff",
"ggg", "hh"), class = "factor"), type = structure(c(1L, 2L, 2L,
2L, 2L, 2L, 3L), .Label = c("Type 1", "Type 2", "Type 4"), class = "factor"),
value = 1:7), .Names = c("state", "city", "type", "value"
), class = "data.frame", row.names = c(NA, -7L))
Dataframe df1
looks like:
State City Type1 Type2
Aaa bb NA NA
Dd e NA NA
and dataframe df2
looks like:
state city type value
Aaa bb Type 1 1
Aaa ccc Type 2 2
Aaa ddd Type 2 3
Dd fff Type 2 4
Dd fff Type 2 5
Dd ggg Type 2 6
Dd hh Type 4 7
For the NA
in df1
, I need to look up the values from df2
according to the following rules:
1) If there is only a single instance where State
= state
and City
= city
for a given type
in df2
, insert value
into the appropriate df1
column Type1
or Type2
2) When there are multiple instances where State
= state
and City
= city
for a given type
, I need to average all value
and insert this into df1
3) If there are no instances where State
= state
and City
= city
for a given type
, I need to obtain the average of all state
for this type
and insert into df1
4) If there are no instances where State
= state
for a given type
, then the value should remain NA
in df1
Just to clarify -- essentially I am trying to average values Type1
and Type2
as "resolved" as possible. In other words, I want to use averages at the City
level when possible, but when that is not possible, then I want to use State
level averages. However, I want to return these averages for the original State
and City
outlined in df1
(even when State
averages are all that is available
I know that is very complicated! The result I am trying to get is
structure(list(State = structure(1:2, .Label = c("Aaa", "Dd"), class = "factor"),
City = structure(1:2, .Label = c("bb", "e"), class = "factor"),
Type1 = c(1L, NA), Type2 = c(2.5, 5)), .Names = c("State",
"City", "Type1", "Type2"), class = "data.frame", row.names = c(NA,
-2L))
which is a dataframe like:
State City Type1 Type2
Aaa bb 1 2.5
Dd e NA 5.0
I don't even know where to begin solving this. My first thought is that I need to use acast
to reshape df2
. For example, I can use
acast(df2, state+city+value~type)
which reshapes the data to more closely resemble df1
, but then I loose some columns that I need to preserve (these get smooshed into the rowname). I don't even know how to begin the challenge of searching for City
and State
and then doing the averaging depending on those results.
Can anyone point me in the right direction?
EDIT (Jan 2015): I added a new comment below Troy's answer below, asking if there was an easy way to add a column identifying the level at which means are calculated (City or State). I've found a solution and although there's likely a better way, it works for me. Hope this helps someone!
getlevel<-function(state,city,type){
m<-means[means$state==state & means$city==city & means$type==type, "mean"]
sm<-state_means[state_means$state==state & state_means$type==type, "mean"]
ifelse(length(m)>0,"city","state")
}
then
ddply(df1,.(State,City),transform,Type1=getval(as.character(State),as.character(City),"Type 1"),Type1Level=getlevel(as.character(State),as.character(City),"Type 1"),Type2=getval(as.character(State),as.character(City),"Type 2"),Type2Level=getlevel(as.character(State),as.character(City),"Type 2"))
Upvotes: 1
Views: 1302
Reputation: 55420
First reshape the data in df2, then use data.table
's key to merge the data appropriately:
library(data.table)
library(reshape2)
dt1 <- as.data.table(df1)
dt2 <- as.data.table(df2)
Type
column of dt2 by reshapingdt2.casted <- reshape2::dcast(dt2, state + city ~ type
, fill=NA_real_
, fun.aggregate=mean, na.rm=TRUE)
dt2.casted <- as.data.table(dt2.casted)
keys
so you can mergesetkey(dt2.casted, state, city)
setkey(dt1, State, City)
NA
sdt1[dt2.casted][, lapply(.SD, mean, na.rm=TRUE), by=State, .SDcols=grep("Type", names(dt2.casted), value=TRUE)]
State Type 1 Type 2 Type 4
1: Aaa 1 2.50 NaN
2: Dd NaN 5.25 7
dt2.casted <- reshape2::dcast(dt2, state ~ type
, fill=NA_real_
, fun.aggregate=mean, na.rm=TRUE)
dt2.casted <- as.data.table(dt2.casted)
setkey(dt2.casted, state)
setkey(dt1, State)
dt1[dt2.casted][, lapply(.SD, mean, na.rm=TRUE)
, by=list(State, City)
, .SDcols=grep("Type"
, names(dt2.casted), value=TRUE)
]
State City Type 1 Type 2 Type 4
1: Aaa bb 1 2.5 NaN
2: Dd e NaN 5.0 7
Upvotes: 0
Reputation: 8691
EDIT - sorry misread the questions: here is the corrected code for your conditions:
require(plyr)
means<-ddply(df2,.(state,city,type),summarize,mean=mean(value))
state_means<-ddply(df2,.(state,type),summarize,mean=mean(value))
getval<-function(state,city,type){
m<-means[means$state==state & means$city==city & means$type==type, "mean"]
sm<-state_means[state_means$state==state & state_means$type==type, "mean"]
ifelse(length(m)>0,m,sm)
}
## this gives you the new df1
ddply(df1,.(State,City),transform,Type1=getval(as.character(State),as.character(City),"Type 1"),Type2=getval(as.character(State),as.character(City),"Type 2"))
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX PREVIOUS ANSWER (INCOMPLETE)
It's a bit difficult because your structure call doesn't work properly for df2, and your example datasets don't give you all the data in your expected result, but I think what you want is:
require(plyr)
means<-ddply(df2,.(state,city,type),summarize,mean=mean(value))
getval<-function(state,city,type){means[means$state==state & means$city==city & means$type==type, "mean"]}
## this gives you the new df1
ddply(df1,.(State,City),transform,Type1=getval(as.character(State),as.character(City),"Type 1"),Type2=getval(as.character(State),as.character(City),"Type 2"))
############################################################################X
## what's happening in detail:
require(plyr) # calls the plyr library
means<-ddply(df2, # base on df2
.(state,city,type), # summarize by combination of city/state/type
summarize, # tells plyr to summarize rather than transform
mean=mean(value)) # show one column at each summary level, called 'mean', the average val
getval<-function(state,city,type){ # create function called getval, takes 3 parameters
means[means$state==state & # first part of [X,]
means$city==city & # selects the row that matches all criteria
means$type==type,
"mean"]} # and [,X] the column relating to the type
getval("Aaa","bb","Type 2")
# this gives you the new df2
ddply(df1, # base on df1
.(State,City), # summarize by State & City
transform, # tell plyr to transform existingn set rather than roll up
Type1=getval(as.character(State),as.character(City),"Type 1"), # call getval() for Type 1
Type2=getval(as.character(State),as.character(City),"Type 2")) # and for Type 2
which gives you the following (not your expected result, but what is implied by the data)
State City Type1 Type2
1 Aaa bb 1 NA
2 Dd e NA NA
Upvotes: 1