Anand
Anand

Reputation: 3760

Get rows in dataframe that have same values in columns as given row

I have a dataframe df, and its first row row1:

df <- data.frame(x = c(1,1,1,1,0,0,1), y = c(0,0,0,0,1,1,0), z = c(1,0,0,0,0,1,1))
> df
  x y z
1 1 0 1
2 1 0 0
3 1 0 0
4 1 0 0
5 0 1 0
6 0 1 1
7 1 0 1
> row1 <- df[1,]
> row1
  x y z
1 1 0 1

For row 1, I want to extract the mean of column z, within the rows that have the same value of other columns (x and y) as in row1. I want to store this value in a new column called prob_z.

In this example, since there are 5 rows with x and y as 1 and 0 respectively, of which 2 rows have z = 1, and 3 rows have z = 0, I would affix a column prob_z to row 1 with the value 2/(2+3) = 0.4. I would similarly compute the values of prob_x, prob_y and prob_z for each row and corresponding column, and store them as columns in df. So, after all these computations by each row and each column, I want to end up with:

  x y z prob_x prob_y  prob_z
1 1 0 1  1.0     0.0     0.4
2 1 0 0  1.0     0.0     0.4  
3 1 0 0  1.0     0.0     0.4
4 1 0 0  1.0     0.0     0.4 
5 0 1 0  0.0     1.0     0.5
6 0 1 1  0.0     1.0     0.5
7 1 0 1  1.0     0.0     0.4

I also need the columns to be referenced by name as in -"z", since in my real example, I have a large number of columns.

I have tried different approaches using base R and dplyr, but am having a hard time getting the result as expected.

Upvotes: 1

Views: 3741

Answers (3)

Anand
Anand

Reputation: 3760

This solution is composed of Nathan Day's approach, Mark Peterson's use of mutate_, and this SO question - this should generalize well.

df <- data.frame(x = c(1,1,1,1,0,0,1), y = c(0,0,0,0,1,1,0), z = c(1,0,0,0,0,1,1))
for(i in 1:3) {
    dots <- paste("mean(", names(df[i]), ")")
    df <- df %>% 
        group_by_(.dots = lapply(names(df)[-i], as.symbol)) %>% 
        mutate_(.dots = setNames(dots, paste("prob_", names(df[i]) )))
}
df
      x     y     z prob_x prob_y prob_z
  <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1     1     0     1      1      0    0.4
2     1     0     0      1      0    0.4
3     1     0     0      1      0    0.4
4     1     0     0      1      0    0.4
5     0     1     0      0      1    0.5
6     0     1     1      0      1    0.5
7     1     0     1      1      0    0.4

Upvotes: 0

Mark Peterson
Mark Peterson

Reputation: 9560

Alright, I think I have it. This was a fun set of challenges, though a fair bit more challenging than I expected it to be. I believe it is robust to any number of columns, and any types of names, but I could certainly be wrong.

First, I am defining a function that checks all columns except the column of interest to see if they match. Note that it explicitly calls df and row1 instead of trying to pass those values in as variables. Probably not ideal, but should work for this case:

myFunction <-
  function(thisCol){
    apply(select_(df, paste0("-`", thisCol,"`"))
          , 1
          , function(thisRow) {
            all(thisRow == select_(row1, paste0("-`", thisCol,"`")))
            })
  }

Then, I used interp from lazyeval to generate a column telling if there is a match or not. Recall that the "match" actually checks to see if all the other columns match row one, not whether or not this one does.

library(lazyeval)

forMatchID <-
  lapply(names(df), function(thisColName){
    interp(~myFunction(colName), colName = thisColName)
  }) %>%
  setNames(paste("Match", names(df)))

Which returns:

$`Match x`
~myFunction("x")
<environment: 0x110feb20>

$`Match y`
~myFunction("y")
<environment: 0x11103da8>

$`Match z`
~myFunction("z")
<environment: 0x111080c8>

Then, I used that to generate a data.frame that said whether or not there was an acceptable match for that column (again, matching all others from row 1 or not):

dfWithMatchCols <-
  df %>%
  mutate_(.dots = forMatchID)

Which returns:

  x y z Match x Match y Match z
1 1 0 1    TRUE    TRUE    TRUE
2 1 0 0   FALSE   FALSE    TRUE
3 1 0 0   FALSE   FALSE    TRUE
4 1 0 0   FALSE   FALSE    TRUE
5 0 1 0   FALSE   FALSE   FALSE
6 0 1 1   FALSE   FALSE   FALSE
7 1 0 1    TRUE    TRUE    TRUE

Then, I generate a new set of columns to generate, here generating the proportion of rows that match row one (on other columns) -- or those that don't -- that have a value of 1 for that column:

forProb <-
  paste0("ifelse(`Match ", names(df), "`"
         , ", mean(`", names(df), "`[`Match ", names(df), "`])"
         , ", mean(`", names(df), "`[!`Match ", names(df), "`]) )") %>%
  setNames(paste0("prob_", names(df)))

Which returns

                                                           prob_x 
"ifelse(`Match x`, mean(`x`[`Match x`]), mean(`x`[!`Match x`]) )" 
                                                           prob_y 
"ifelse(`Match y`, mean(`y`[`Match y`]), mean(`y`[!`Match y`]) )" 
                                                           prob_z 
"ifelse(`Match z`, mean(`z`[`Match z`]), mean(`z`[!`Match z`]) )" 

Finally, I pass this into mutate_ and remove the "Match" columns (note, could accidentally remove columns if they start with that, but that would likely also cause collisions above):

dfWithProb <-
  dfWithMatchCols %>%
  mutate_(.dots = forProb) %>%
  select(-starts_with("Match"))

Returns:

  x y z prob_x prob_y prob_z
1 1 0 1    1.0    0.0    0.4
2 1 0 0    0.6    0.4    0.4
3 1 0 0    0.6    0.4    0.4
4 1 0 0    0.6    0.4    0.4
5 0 1 0    0.6    0.4    0.5
6 0 1 1    0.6    0.4    0.5
7 1 0 1    1.0    0.0    0.4

In practice, you would probably not generate the intermediate step, instead running it like this:

df %>%
  mutate_(.dots = forMatchID) %>%
  mutate_(.dots = forProb) %>%
  select(-starts_with("Match"))

Now, if I misunderstood your "matching" argument, a simple change to the myFunction definition should then propagate to all other steps in the process.

Summarise for a single value

Based on a comment, it appears that you actually want just a single output, to predict probability only for those that match on all other elements. I think you may be better served with using summarise_, like so:

forSingProb <-
  paste0("mean(`", names(df), "`[`Match ", names(df), "`])") %>%
  setNames(paste0("prob_", names(df)))

df %>%
  mutate_(.dots = forMatchID) %>%
  summarise_(.dots = forSingProb)

Which returns:

  prob_x prob_y prob_z
1      1      0    0.4

Getting values for all rows

Alright, following several updates in comments, I think this should work. I am using the summarise_ approach from above, looping through each row separately in lapply, and removing the row that is being investigated from the analysis (it's inclusion of the value of interest should not play a role in the outcome), then binding everything together with bind_rows and joining back to the original data:

myFunction_updated <-
  function(thisCol, rowIndex){
    apply(select_(df[-rowIndex, ], paste0("-`", thisCol,"`"))
          , 1
          , function(thisRow) {
            all(thisRow == select_(df[rowIndex, ], paste0("-`", thisCol,"`")))
          })
  }

forSingProb <-
  paste0("mean(`", names(df), "`[`Match ", names(df), "`])") %>%
  setNames(paste0("prob_", names(df)))

lapply(1:nrow(df), function(thisRowIndex){
  forMatchID <-
    lapply(names(df), function(thisColName){
      interp(~myFunction_updated(colName, rowIndex)
             , colName = thisColName
             , rowIndex = thisRowIndex)
    }) %>%
    setNames(paste("Match", names(df)))

  df[-thisRowIndex, ] %>%
    mutate_(.dots = forMatchID) %>%
    summarise_(.dots = forSingProb)
}) %>%
  bind_rows(.id = "rowIndex") %>%
  left_join(mutate(df, rowIndex = as.character(1:n()))
            , .)

returns:

  x y z rowIndex prob_x prob_y prob_z
1 1 0 1        1      1      0   0.25
2 1 0 0        2      1      0   0.50
3 1 0 0        3      1      0   0.50
4 1 0 0        4      1      0   0.50
5 0 1 0        5    NaN    NaN   1.00
6 0 1 1        6    NaN    NaN   0.00
7 1 0 1        7      1      0   0.25

Note that the NaN values are correct, as there are no matching rows.

If you insist on including the observed value in your prediction (which I want to stress is almost certainly a bad idea), you can adjust it like so:

myFunction_updated <-
  function(thisCol, rowIndex){
    apply(select_(df, paste0("-`", thisCol,"`"))
          , 1
          , function(thisRow) {
            all(thisRow == select_(df[rowIndex, ], paste0("-`", thisCol,"`")))
          })
  }

forSingProb <-
  paste0("mean(`", names(df), "`[`Match ", names(df), "`])") %>%
  setNames(paste0("prob_", names(df)))


lapply(1:nrow(df), function(thisRowIndex){
  forMatchID <-
    lapply(names(df), function(thisColName){
      interp(~myFunction_updated(colName, rowIndex)
             , colName = thisColName
             , rowIndex = thisRowIndex)
    }) %>%
    setNames(paste("Match", names(df)))

  df %>%
    mutate_(.dots = forMatchID) %>%
    summarise_(.dots = forSingProb)
}) %>%
  bind_rows(.id = "rowIndex") %>%
  left_join(mutate(df, rowIndex = as.character(1:n()))
            , .)

Which gives:

  x y z rowIndex prob_x prob_y prob_z
1 1 0 1        1      1      0    0.4
2 1 0 0        2      1      0    0.4
3 1 0 0        3      1      0    0.4
4 1 0 0        4      1      0    0.4
5 0 1 0        5      0      1    0.5
6 0 1 1        6      0      1    0.5
7 1 0 1        7      1      0    0.4

For use in production

Why can't I leave this alone? Anyway, here is an extension to a logical next step. My guess is that this will be used to predict missing variables from a given set. Here, for example, as might be generated in your actual application. Specifically, it samples 1 or 2 of the variables, and randomly sets each to 0 or 1.

productionData <-
  lapply(1:10, function(idx){
    nToSample <- sample(1:2, 1)

    sample(c(0,1), nToSample, replace = TRUE) %>%
      setNames(sample(c("x","y","z"), nToSample))
  })

Then, we can loop through each, filter the known data to those that match on all points, and then calculate a probability for all of the remaining variables. The bind_rows and select at the end are just for pretty printing. Depending on usage, this step may or may not be more useful. In particular, if any variables are either always or never present, this will fail (because there are no prob_* values or no entries for that variable column)

lapply(productionData, function(thisRowIn){
  filtering <-
    lapply(names(thisRowIn), function(thisCol){
      paste0("`", thisCol, "` == ", thisRowIn[thisCol])
    })

  whichMissing <-
    names(df)[!(names(df) %in% names(thisRowIn))]

  df %>%
    filter_(.dots = filtering) %>%
    summarise_at(whichMissing, mean) %>%
    setNames(paste0("prob_", names(.))) %>%
    mutate_(.dots = as.list(thisRowIn)) 
}) %>%
  bind_rows() %>%
  select_(.dots = c(names(df), paste0("prob_", names(df))))

returns:

    x  y  z prob_x prob_y prob_z
1  NA  1 NA   0.00     NA    0.5
2  NA  0 NA   1.00     NA    0.4
3   0  1 NA     NA     NA    0.5
4  NA  1  1   0.00     NA     NA
5   1 NA  1     NA   0.00     NA
6   1  0 NA     NA     NA    0.4
7  NA NA  0   0.75   0.25     NA
8   1  0 NA     NA     NA    0.4
9  NA  0 NA   1.00     NA    0.4
10  1 NA  1     NA   0.00     NA

With NA where values are missing and where predictions are not needed.

Upvotes: 2

Nate
Nate

Reputation: 10671

Would it be acceptable if we did this?

df %>% group_by(x) %>% mutate(prob_x = sum(x) / n(),
                                prob_y = sum(y) / n(),
                                prob_z = sum(z) / n())
Source: local data frame [7 x 6]
Groups: x [2]

      x     y     z prob_x prob_y prob_z
  <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1     1     0     1      1      0    0.4
2     1     0     0      1      0    0.4
3     1     0     0      1      0    0.4
4     1     0     0      1      0    0.4
5     0     1     0      0      1    0.5
6     0     1     1      0      1    0.5
7     1     0     1      1      0    0.4

Upvotes: 1

Related Questions