Reputation: 607
My situation is that I am trying to clean up a data set of student results for processing and I'm having some issues with completely removing duplicates as only wanting to look at "first attempts" but some students have taken the course multiple times. An example of the data using one of the duplicates is:
id period desc
632 1507 1101 90714 Research a contemporary biological issue
633 1507 1101 6317 Explain the process of speciation
634 1507 1101 8931 Describe gene expression
14448 1507 1201 8931 Describe gene expression
14449 1507 1201 6317 Explain the process of speciation
14450 1507 1201 90714 Research a contemporary biological issue
25884 1507 1301 6317 Explain the process of speciation
25885 1507 1301 8931 Describe gene expression
25886 1507 1301 90714 Research a contemporary biological issue
The first 2 digits of reg_period
are the year they sat the paper. As can be seen, I would want to be keeping where id
is 1507 and reg_period
is 1101. So far, an example of my code to get the values I want to be trimming is:
unique.rows <- unique(df[c("id", "period")])
dups <- (unique.rows[duplicated(unique.rows$id),])
However, there are a couple of problems I am then running in to. This only works because the data is ordered by id
and reg_period
and this isn't guaranteed in future. Plus I don't know how to then take this list of duplicate entries and then select the rows that are not in it because %in%
doesn't seem to work with it and a loop with rbind
runs out of memory.
What's the best way to handle this?
Upvotes: 1
Views: 181
Reputation: 145755
I would probably use dplyr
. Calling your data df
:
result = df %>% group_by(id) %>%
filter(period == min(period))
If you prefer base
, I would pull the id
/period
combinations to keep into a separate data frame and then do an inner join with the original data:
id_pd = df[order(df$id, df$pd), c("id", "period")]
id_pd = id_pd[!duplicated(df$id), ]
result = merge(df, id_pd)
Upvotes: 2
Reputation: 3154
Try this, it works for me with your data:
dd <- read.csv("a.csv", colClasses=c("numeric","numeric","character"), header=TRUE)
print (dd)
dd <- dd[order(dd$id, dd$period), ]
dd <- dd[!duplicated(dd[, c("id","period")]), ]
print (dd)
Output:
id period desc
1 1507 1101 90714 Research a contemporary biological issue
4 1507 1201 8931 Describe gene expression
7 1507 1301 6317 Explain the process of speciation
Upvotes: 0