Joe Huntley
Joe Huntley

Reputation: 13

Removing rows of subsetted data that occur only once

(Inexperienced coder -- sorry if easy question!)

I have two data frames, x and y:

  • x: contains one column of unique numeric IDs
  • y: contains one column of non-unique numeric IDs and one column of a variable numeric code (diagnostic medical codes)

    All IDs in y are in x (and there are no extra IDs in either data frame), but the IDs in y can repeat for several rows. Each ID in y has a code in the adjacent column. These codes can vary within and between IDs, and there can be multiple of the same code per ID (i.e., there can be multiple of the same ID-code pairings).

    What I want to do: for each ID in x (or a unique list of IDs in y), select the rows in y that have a matching ID and remove them from y if the code in the second column occurs one time only for that specific ID. (Or, for each ID in x, choose the matching rows in y that have codes which occur more than once, and make a new data frame with just those ID-code pairings.)

    It is also possible (likely?) that I won't need to use x at all, because the IDs in y and x are the exact same, but are just repeated in y.


    I tried using a for loop to make a new df, but it seems to take forever and ever...

    df <- data.frame()
    for (a in x$id){
          r <- subset(y, id==a)
          s <- count(r, code) %>% subset(n > 1)
          t <- subset(r, code %in% s)
          df <- rbind(df, t)
         }
    

    (I'm sure this is terribly inefficient, please don't make fun of me. Also wrote this from memory so hopefully it runs.)

    Example data frames:

    y

    id      code
    12345   1092
    12345   1092
    12345   7448
    12345   2274
    74839   6117
    74839   8337
    90029   1092
    90029   1092
    90029   1092
    90029   5521
    90029   5521
    90029   9338
    

    x

    id
    12345
    74839
    90029
    

    The output would ideally be:

    id      code
    12345   1092
    12345   1092
    90029   1092
    90029   1092
    90029   1092
    90029   5521
    90029   5521
    


    Is there a better way to do this? Thank you for any help.

    Upvotes: 1

    Views: 1792

  • Answers (1)

    jazzurro
    jazzurro

    Reputation: 23574

    One way would be the following. First, first you subset observations in y using ids in x. Then, you group your data with id and code and remove any groups, which have only one observation.

    library(dplyr)
    
    filter(y, id %in% x$id) %>%
    group_by(id, code) %>%
    filter(n() != 1) %>%
    ungroup
    

    Another way would be the following.

    filter(y, id %in% x$id) %>%
    group_by(id) %>%
    filter(!(!duplicated(code) & !duplicated(code, fromLast = TRUE))) 
    
    
    #     id  code
    #  <int> <int>
    #1 12345  1092
    #2 12345  1092
    #3 90029  1092
    #4 90029  1092
    #5 90029  1092
    #6 90029  5521
    #7 90029  5521
    

    Upvotes: 2

    Related Questions