Removing rows of subsetted data that occur only once

Question

(Inexperienced coder -- sorry if easy question!)

I have two data frames, x and y:

x: contains one column of unique numeric IDs

y: contains one column of non-unique numeric IDs and one column of a variable numeric code (diagnostic medical codes)

All IDs in y are in x (and there are no extra IDs in either data frame), but the IDs in y can repeat for several rows. Each ID in y has a code in the adjacent column. These codes can vary within and between IDs, and there can be multiple of the same code per ID (i.e., there can be multiple of the same ID-code pairings).

What I want to do: for each ID in x (or a unique list of IDs in y), select the rows in y that have a matching ID and remove them from y if the code in the second column occurs one time only for that specific ID. (Or, for each ID in x, choose the matching rows in y that have codes which occur more than once, and make a new data frame with just those ID-code pairings.)

It is also possible (likely?) that I won't need to use x at all, because the IDs in y and x are the exact same, but are just repeated in y.

I tried using a for loop to make a new df, but it seems to take forever and ever...

df <- data.frame()
for (a in x$id){
      r <- subset(y, id==a)
      s <- count(r, code) %>% subset(n > 1)
      t <- subset(r, code %in% s)
      df <- rbind(df, t)
     }

(I'm sure this is terribly inefficient, please don't make fun of me. Also wrote this from memory so hopefully it runs.)

Example data frames:

y

id      code
12345   1092
12345   1092
12345   7448
12345   2274
74839   6117
74839   8337
90029   1092
90029   1092
90029   1092
90029   5521
90029   5521
90029   9338

x

The output would ideally be:

id      code
12345   1092
12345   1092
90029   1092
90029   1092
90029   1092
90029   5521
90029   5521

Is there a better way to do this? Thank you for any help.

jazzurro · Accepted Answer

One way would be the following. First, first you subset observations in y using ids in x. Then, you group your data with id and code and remove any groups, which have only one observation.

library(dplyr)

filter(y, id %in% x$id) %>%
group_by(id, code) %>%
filter(n() != 1) %>%
ungroup

Another way would be the following.

filter(y, id %in% x$id) %>%
group_by(id) %>%
filter(!(!duplicated(code) & !duplicated(code, fromLast = TRUE))) 


#     id  code
#   
#1 12345  1092
#2 12345  1092
#3 90029  1092
#4 90029  1092
#5 90029  1092
#6 90029  5521
#7 90029  5521

Removing rows of subsetted data that occur only once

Answers (1)

Related Questions