Couch Tomato
Couch Tomato

Reputation: 85

How to specify and clean/delete the rows?

For example, here is the dataframe. I want to delete the rows if the age difference in each ProjectNumber is less than 5. Within ProjectNumber = 1, the age difference are 6(56-50) so rows in ProjectNumber 1 is maintained. In ProjectNumber2, the age difference are 2(37-35), so 3 rows in ProjectNumber 2 will be deleted. How can I do that? There are thousands of Project Number in real data set so can't really do it mannually. The key problem is how to specify these rows and delete them. Thank you.

ProjectNumber Age
1 50
1 52
1 53
1 55
1 56
2 35
2 36
2 37
3 40
3 41
3 42
3 43
3 45
3 46

Upvotes: 4

Views: 65

Answers (4)

ThomasIsCoding
ThomasIsCoding

Reputation: 102349

A data.table option

> setDT(df)[, .SD[diff(range(Age)) >= 5], ProjectNumber]
    ProjectNumber Age
 1:             1  50
 2:             1  52
 3:             1  53
 4:             1  55
 5:             1  56
 6:             3  40
 7:             3  41
 8:             3  42
 9:             3  43
10:             3  45
11:             3  46

Upvotes: 1

TarJae
TarJae

Reputation: 79174

Akrun's answer is top. Here: cumbersome but leads to the goal:

df %>% 
  group_by(ProjectNumber) %>% 
  mutate(diff_Age = max(Age) - min(Age)) %>% 
  subset(!diff_Age < 5) %>% 
  select(-diff_Age)

Output:

   ProjectNumber   Age
           <int> <int>
 1             1    50
 2             1    52
 3             1    53
 4             1    55
 5             1    56
 6             3    40
 7             3    41
 8             3    42
 9             3    43
10             3    45
11             3    46

Upvotes: 3

r2evans
r2evans

Reputation: 160647

Base R variants:

  1. aggregate then merge. This version is good if you have more than one "key" (ProjectNumber here).

    keep <- subset(aggregate(Age ~ ProjectNumber, data = df1,
                             FUN = function(z) diff(range((z))) >= 5), Age)
    keep
    #   ProjectNumber  Age
    # 1             1 TRUE
    # 3             3 TRUE
    
    merge(df1, keep[,1,drop=FALSE], by = "ProjectNumber")
    #    ProjectNumber Age
    # 1              1  50
    # 2              1  52
    # 3              1  53
    # 4              1  55
    # 5              1  56
    # 9              3  40
    # 10             3  41
    # 11             3  42
    # 12             3  43
    # 13             3  45
    # 14             3  46
    
  2. aggregate then %in%:

    ### 'keep' from above
    subset(df1, ProjectNumber %in% keep$ProjectNumber)
    
  3. ave then simple subset:

    keep <- ave(df1$Age, df1$ProjectNumber, 
                FUN = function(z) diff(range(z)) >= 5) > 0
    keep
    #  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    df1[keep,]
    #    ProjectNumber Age
    # 1              1  50
    # 2              1  52
    # 3              1  53
    # 4              1  55
    # 5              1  56
    # 9              3  40
    # 10             3  41
    # 11             3  42
    # 12             3  43
    # 13             3  45
    # 14             3  46
    

Upvotes: 5

akrun
akrun

Reputation: 887571

We do a group by filter i.e. grouped by 'ProjectNumber', check the difference of range of 'Age' is greater than or equal to 5

library(dplyr)
df1 %>% 
  group_by(ProjectNumber) %>% 
  filter(diff(range(Age)) >=5) %>%
  ungroup

data

df1 <- structure(list(ProjectNumber = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 3L), Age = c(50L, 52L, 53L, 55L, 56L, 
35L, 36L, 37L, 40L, 41L, 42L, 43L, 45L, 46L)), class = "data.frame",
row.names = c(NA, 
-14L))

Upvotes: 2

Related Questions