Reputation: 13
(Inexperienced coder -- sorry if easy question!)
I have two data frames, x
and y
:
x
: contains one column of unique numeric IDs
y
: contains one column of non-unique numeric IDs and one column of a variable numeric code (diagnostic medical codes)
All IDs in y
are in x
(and there are no extra IDs in either data frame), but the IDs in y
can repeat for several rows. Each ID in y
has a code in the adjacent column. These codes can vary within and between IDs, and there can be multiple of the same code per ID (i.e., there can be multiple of the same ID-code pairings).
What I want to do: for each ID in x
(or a unique list of IDs in y
), select the rows in y
that have a matching ID and remove them from y
if the code in the second column occurs one time only for that specific ID. (Or, for each ID in x
, choose the matching rows in y
that have codes which occur more than once, and make a new data frame with just those ID-code pairings.)
It is also possible (likely?) that I won't need to use x
at all, because the IDs in y
and x
are the exact same, but are just repeated in y
.
I tried using a for loop to make a new df, but it seems to take forever and ever...
df <- data.frame()
for (a in x$id){
r <- subset(y, id==a)
s <- count(r, code) %>% subset(n > 1)
t <- subset(r, code %in% s)
df <- rbind(df, t)
}
(I'm sure this is terribly inefficient, please don't make fun of me. Also wrote this from memory so hopefully it runs.)
Example data frames:
y
id code
12345 1092
12345 1092
12345 7448
12345 2274
74839 6117
74839 8337
90029 1092
90029 1092
90029 1092
90029 5521
90029 5521
90029 9338
x
id
12345
74839
90029
The output would ideally be:
id code
12345 1092
12345 1092
90029 1092
90029 1092
90029 1092
90029 5521
90029 5521
Is there a better way to do this? Thank you for any help.
Upvotes: 1
Views: 1792
Reputation: 23574
One way would be the following. First, first you subset observations in y using ids in x. Then, you group your data with id and code and remove any groups, which have only one observation.
library(dplyr)
filter(y, id %in% x$id) %>%
group_by(id, code) %>%
filter(n() != 1) %>%
ungroup
Another way would be the following.
filter(y, id %in% x$id) %>%
group_by(id) %>%
filter(!(!duplicated(code) & !duplicated(code, fromLast = TRUE)))
# id code
# <int> <int>
#1 12345 1092
#2 12345 1092
#3 90029 1092
#4 90029 1092
#5 90029 1092
#6 90029 5521
#7 90029 5521
Upvotes: 2