Reputation: 3760
I have a dataframe df, and its first row row1:
df <- data.frame(x = c(1,1,1,1,0,0,1), y = c(0,0,0,0,1,1,0), z = c(1,0,0,0,0,1,1))
> df
x y z
1 1 0 1
2 1 0 0
3 1 0 0
4 1 0 0
5 0 1 0
6 0 1 1
7 1 0 1
> row1 <- df[1,]
> row1
x y z
1 1 0 1
For row 1, I want to extract the mean of column z, within the rows that have the same value of other columns (x and y) as in row1. I want to store this value in a new column called prob_z.
In this example, since there are 5 rows with x and y as 1 and 0 respectively, of which 2 rows have z = 1, and 3 rows have z = 0, I would affix a column prob_z to row 1 with the value 2/(2+3) = 0.4. I would similarly compute the values of prob_x, prob_y and prob_z for each row and corresponding column, and store them as columns in df. So, after all these computations by each row and each column, I want to end up with:
x y z prob_x prob_y prob_z
1 1 0 1 1.0 0.0 0.4
2 1 0 0 1.0 0.0 0.4
3 1 0 0 1.0 0.0 0.4
4 1 0 0 1.0 0.0 0.4
5 0 1 0 0.0 1.0 0.5
6 0 1 1 0.0 1.0 0.5
7 1 0 1 1.0 0.0 0.4
I also need the columns to be referenced by name as in -"z", since in my real example, I have a large number of columns.
I have tried different approaches using base R and dplyr, but am having a hard time getting the result as expected.
Upvotes: 1
Views: 3741
Reputation: 3760
This solution is composed of Nathan Day's approach, Mark Peterson's use of mutate_, and this SO question - this should generalize well.
df <- data.frame(x = c(1,1,1,1,0,0,1), y = c(0,0,0,0,1,1,0), z = c(1,0,0,0,0,1,1))
for(i in 1:3) {
dots <- paste("mean(", names(df[i]), ")")
df <- df %>%
group_by_(.dots = lapply(names(df)[-i], as.symbol)) %>%
mutate_(.dots = setNames(dots, paste("prob_", names(df[i]) )))
}
df
x y z prob_x prob_y prob_z
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 1 0 0.4
2 1 0 0 1 0 0.4
3 1 0 0 1 0 0.4
4 1 0 0 1 0 0.4
5 0 1 0 0 1 0.5
6 0 1 1 0 1 0.5
7 1 0 1 1 0 0.4
Upvotes: 0
Reputation: 9560
Alright, I think I have it. This was a fun set of challenges, though a fair bit more challenging than I expected it to be. I believe it is robust to any number of columns, and any types of names, but I could certainly be wrong.
First, I am defining a function that checks all columns except the column of interest to see if they match. Note that it explicitly calls df
and row1
instead of trying to pass those values in as variables. Probably not ideal, but should work for this case:
myFunction <-
function(thisCol){
apply(select_(df, paste0("-`", thisCol,"`"))
, 1
, function(thisRow) {
all(thisRow == select_(row1, paste0("-`", thisCol,"`")))
})
}
Then, I used interp
from lazyeval
to generate a column telling if there is a match or not. Recall that the "match" actually checks to see if all the other columns match row one, not whether or not this one does.
library(lazyeval)
forMatchID <-
lapply(names(df), function(thisColName){
interp(~myFunction(colName), colName = thisColName)
}) %>%
setNames(paste("Match", names(df)))
Which returns:
$`Match x`
~myFunction("x")
<environment: 0x110feb20>
$`Match y`
~myFunction("y")
<environment: 0x11103da8>
$`Match z`
~myFunction("z")
<environment: 0x111080c8>
Then, I used that to generate a data.frame that said whether or not there was an acceptable match for that column (again, matching all others from row 1 or not):
dfWithMatchCols <-
df %>%
mutate_(.dots = forMatchID)
Which returns:
x y z Match x Match y Match z
1 1 0 1 TRUE TRUE TRUE
2 1 0 0 FALSE FALSE TRUE
3 1 0 0 FALSE FALSE TRUE
4 1 0 0 FALSE FALSE TRUE
5 0 1 0 FALSE FALSE FALSE
6 0 1 1 FALSE FALSE FALSE
7 1 0 1 TRUE TRUE TRUE
Then, I generate a new set of columns to generate, here generating the proportion of rows that match row one (on other columns) -- or those that don't -- that have a value of 1
for that column:
forProb <-
paste0("ifelse(`Match ", names(df), "`"
, ", mean(`", names(df), "`[`Match ", names(df), "`])"
, ", mean(`", names(df), "`[!`Match ", names(df), "`]) )") %>%
setNames(paste0("prob_", names(df)))
Which returns
prob_x
"ifelse(`Match x`, mean(`x`[`Match x`]), mean(`x`[!`Match x`]) )"
prob_y
"ifelse(`Match y`, mean(`y`[`Match y`]), mean(`y`[!`Match y`]) )"
prob_z
"ifelse(`Match z`, mean(`z`[`Match z`]), mean(`z`[!`Match z`]) )"
Finally, I pass this into mutate_
and remove the "Match
" columns (note, could accidentally remove columns if they start with that, but that would likely also cause collisions above):
dfWithProb <-
dfWithMatchCols %>%
mutate_(.dots = forProb) %>%
select(-starts_with("Match"))
Returns:
x y z prob_x prob_y prob_z
1 1 0 1 1.0 0.0 0.4
2 1 0 0 0.6 0.4 0.4
3 1 0 0 0.6 0.4 0.4
4 1 0 0 0.6 0.4 0.4
5 0 1 0 0.6 0.4 0.5
6 0 1 1 0.6 0.4 0.5
7 1 0 1 1.0 0.0 0.4
In practice, you would probably not generate the intermediate step, instead running it like this:
df %>%
mutate_(.dots = forMatchID) %>%
mutate_(.dots = forProb) %>%
select(-starts_with("Match"))
Now, if I misunderstood your "matching" argument, a simple change to the myFunction
definition should then propagate to all other steps in the process.
Based on a comment, it appears that you actually want just a single output, to predict probability only for those that match on all other elements. I think you may be better served with using summarise_
, like so:
forSingProb <-
paste0("mean(`", names(df), "`[`Match ", names(df), "`])") %>%
setNames(paste0("prob_", names(df)))
df %>%
mutate_(.dots = forMatchID) %>%
summarise_(.dots = forSingProb)
Which returns:
prob_x prob_y prob_z
1 1 0 0.4
Alright, following several updates in comments, I think this should work. I am using the summarise_
approach from above, looping through each row separately in lapply
, and removing the row that is being investigated from the analysis (it's inclusion of the value of interest should not play a role in the outcome), then binding everything together with bind_rows
and joining back to the original data:
myFunction_updated <-
function(thisCol, rowIndex){
apply(select_(df[-rowIndex, ], paste0("-`", thisCol,"`"))
, 1
, function(thisRow) {
all(thisRow == select_(df[rowIndex, ], paste0("-`", thisCol,"`")))
})
}
forSingProb <-
paste0("mean(`", names(df), "`[`Match ", names(df), "`])") %>%
setNames(paste0("prob_", names(df)))
lapply(1:nrow(df), function(thisRowIndex){
forMatchID <-
lapply(names(df), function(thisColName){
interp(~myFunction_updated(colName, rowIndex)
, colName = thisColName
, rowIndex = thisRowIndex)
}) %>%
setNames(paste("Match", names(df)))
df[-thisRowIndex, ] %>%
mutate_(.dots = forMatchID) %>%
summarise_(.dots = forSingProb)
}) %>%
bind_rows(.id = "rowIndex") %>%
left_join(mutate(df, rowIndex = as.character(1:n()))
, .)
returns:
x y z rowIndex prob_x prob_y prob_z
1 1 0 1 1 1 0 0.25
2 1 0 0 2 1 0 0.50
3 1 0 0 3 1 0 0.50
4 1 0 0 4 1 0 0.50
5 0 1 0 5 NaN NaN 1.00
6 0 1 1 6 NaN NaN 0.00
7 1 0 1 7 1 0 0.25
Note that the NaN
values are correct, as there are no matching rows.
If you insist on including the observed value in your prediction (which I want to stress is almost certainly a bad idea), you can adjust it like so:
myFunction_updated <-
function(thisCol, rowIndex){
apply(select_(df, paste0("-`", thisCol,"`"))
, 1
, function(thisRow) {
all(thisRow == select_(df[rowIndex, ], paste0("-`", thisCol,"`")))
})
}
forSingProb <-
paste0("mean(`", names(df), "`[`Match ", names(df), "`])") %>%
setNames(paste0("prob_", names(df)))
lapply(1:nrow(df), function(thisRowIndex){
forMatchID <-
lapply(names(df), function(thisColName){
interp(~myFunction_updated(colName, rowIndex)
, colName = thisColName
, rowIndex = thisRowIndex)
}) %>%
setNames(paste("Match", names(df)))
df %>%
mutate_(.dots = forMatchID) %>%
summarise_(.dots = forSingProb)
}) %>%
bind_rows(.id = "rowIndex") %>%
left_join(mutate(df, rowIndex = as.character(1:n()))
, .)
Which gives:
x y z rowIndex prob_x prob_y prob_z
1 1 0 1 1 1 0 0.4
2 1 0 0 2 1 0 0.4
3 1 0 0 3 1 0 0.4
4 1 0 0 4 1 0 0.4
5 0 1 0 5 0 1 0.5
6 0 1 1 6 0 1 0.5
7 1 0 1 7 1 0 0.4
Why can't I leave this alone? Anyway, here is an extension to a logical next step. My guess is that this will be used to predict missing variables from a given set. Here, for example, as might be generated in your actual application. Specifically, it samples 1 or 2 of the variables, and randomly sets each to 0 or 1.
productionData <-
lapply(1:10, function(idx){
nToSample <- sample(1:2, 1)
sample(c(0,1), nToSample, replace = TRUE) %>%
setNames(sample(c("x","y","z"), nToSample))
})
Then, we can loop through each, filter the known data to those that match on all points, and then calculate a probability for all of the remaining variables. The bind_rows
and select
at the end are just for pretty printing. Depending on usage, this step may or may not be more useful. In particular, if any variables are either always or never present, this will fail (because there are no prob_*
values or no entries for that variable column)
lapply(productionData, function(thisRowIn){
filtering <-
lapply(names(thisRowIn), function(thisCol){
paste0("`", thisCol, "` == ", thisRowIn[thisCol])
})
whichMissing <-
names(df)[!(names(df) %in% names(thisRowIn))]
df %>%
filter_(.dots = filtering) %>%
summarise_at(whichMissing, mean) %>%
setNames(paste0("prob_", names(.))) %>%
mutate_(.dots = as.list(thisRowIn))
}) %>%
bind_rows() %>%
select_(.dots = c(names(df), paste0("prob_", names(df))))
returns:
x y z prob_x prob_y prob_z
1 NA 1 NA 0.00 NA 0.5
2 NA 0 NA 1.00 NA 0.4
3 0 1 NA NA NA 0.5
4 NA 1 1 0.00 NA NA
5 1 NA 1 NA 0.00 NA
6 1 0 NA NA NA 0.4
7 NA NA 0 0.75 0.25 NA
8 1 0 NA NA NA 0.4
9 NA 0 NA 1.00 NA 0.4
10 1 NA 1 NA 0.00 NA
With NA
where values are missing and where predictions are not needed.
Upvotes: 2
Reputation: 10671
Would it be acceptable if we did this?
df %>% group_by(x) %>% mutate(prob_x = sum(x) / n(),
prob_y = sum(y) / n(),
prob_z = sum(z) / n())
Source: local data frame [7 x 6]
Groups: x [2]
x y z prob_x prob_y prob_z
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 1 0 0.4
2 1 0 0 1 0 0.4
3 1 0 0 1 0 0.4
4 1 0 0 1 0 0.4
5 0 1 0 0 1 0.5
6 0 1 1 0 1 0.5
7 1 0 1 1 0 0.4
Upvotes: 1