SSMFTKIB
SSMFTKIB

Reputation: 3

Removing groups from dataframe if variable has repeated values

I would like to ask if there is a way of removing a group from dataframe using dplyr (or anz other way in that matter) in the following way. Lets say I have a dataframe in the following form grouped by variable 1:

Variable 1   Variable 2
1            a
1            b     
2            a
2            a
2            b
3            a
3            c
3            a
...          ...

I would like to remove only groups that have in Variable 2 two consecutive same values. That is in table above it would remove group 2 because there are values a,a,b but not group c where is a,c,a. So I would get the table bellow?

Variable 1   Variable 2
1            a
1            b     
3            a
3            c
3            a
...          ...

Upvotes: 0

Views: 602

Answers (3)

Gwang-Jin Kim
Gwang-Jin Kim

Reputation: 10035

prepare data frame:

df <- data.frame("Variable 1" = c(1, 1, 2, 2, 2, 3, 3, 3), "Variable 2" = unlist(strsplit("abaabaca", "")))

write functions to test if consecutive repetitions are there or not:

any.consecutive.p <- function(v) {
  for (i in 1:(length(v) - 1)) {
    if (v[i] == v[i + 1]) {
      return(TRUE)
    }
  }
  return(FALSE)
}

any.consecutive.in.col.p <- function(df, col) {
  any.consecutive.p(df[, col])
}

any.consecutive.p returns TRUE if it finds first consecutive repetition in a vector (v). any.consecutive.in.col.p() looks for consecutive repetitions in a column of a data frame.

split data frame by values of Variable.1

df.l <- split(df, df$Variable.1)

df.l
$`1`
  Variable.1 Variable.2
1          1          a
2          1          b

$`2`
  Variable.1 Variable.2
3          2          a
4          2          a
5          2          b

$`3`
  Variable.1 Variable.2
6          3          a
7          3          c
8          3          a

Finally go over this data.frame list and test for each data frame, if it contains consecutive duplicates in Variable.2 column. If found, don't collect it. Bind the collected data frames by rows.

Reduce(rbind, lapply(df.l, function(df) if(!any.consecutive.in.col.p(df, "Variable.2")) {df}))

  Variable.1 Variable.2
1          1          a
2          1          b
6          3          a
7          3          c
8          3          a

Upvotes: 1

IceCreamToucan
IceCreamToucan

Reputation: 28705

Say you want to remove all groups of df, grouped by a, where the column b has repeated values. You can do that as below.

set.seed(0)
df <- data.frame(a = rep(1:3, rep(3, 3)), b = sample(1:5, 9, T))


# dplyr
library(dplyr)

df %>% 
  group_by(a) %>% 
  filter(all(b != lag(b), na.rm = T))


#data.table
library(data.table)
setDT(df)

df[, if(all(b != shift(b), na.rm = T)) .SD, by = a]

Benchmark shows data.table is faster

#Results
# Unit: milliseconds
#         expr       min        lq     mean    median        uq      max neval
#  use_dplyr() 141.46819 165.03761 201.0975 179.48334 205.82301 539.5643   100
#     use_DT()  36.27936  50.23011  64.9218  53.87114  66.73943 345.2863   100

# Method
set.seed(0)
df <- data.table(a = rep(1:2000, rep(1e3, 2000)), b = sample(1:1e3, 2e6, T))

use_dplyr <- function(x){
df %>% 
  group_by(a) %>% 
  filter(all(b != lag(b), na.rm = T))
}


use_DT <- function(x){

  df[, if (all(b != shift(b), na.rm = T)) .SD, a]

}

microbenchmark(use_dplyr(), use_DT())

Upvotes: 0

camille
camille

Reputation: 16881

To test for consecutive identical values, you can compare a value to the previous value in that column. In dplyr, this is possible with lag. (You could do the same thing with comparing to the next value, using lead. Result comes out the same.)

Group the data by variable1, get the lag of variable2, then add up how many of these duplicates there are in that group. Then filter for just the groups with no duplicates. After that, feel free to remove the dupesInGroup column.

library(tidyverse)

df %>%
    group_by(variable1) %>%
    mutate(dupesInGroup = sum(variable2 == lag(variable2), na.rm = T)) %>%
    filter(dupesInGroup == 0)
#> # A tibble: 5 x 3
#> # Groups:   variable1 [2]
#>   variable1 variable2 dupesInGroup
#>       <int> <chr>            <int>
#> 1         1 a                    0
#> 2         1 b                    0
#> 3         3 a                    0
#> 4         3 c                    0
#> 5         3 a                    0

Created on 2018-05-10 by the reprex package (v0.2.0).

Upvotes: 1

Related Questions