Reputation: 413
I have a data frame which contains an event history and I want to check its integrity by checking whether the last event for each ID number matches the current value in the system for that ID number. The data are coded as factors. The following toy data frame is a minimal example:
df <-data.frame(ID=c(1,1,1,1,2,2,2,3,3),
current.grade=as.factor(c("Senior","Senior","Senior","Senior",
"Junior","Junior","Junior",
"Sophomore","Sophomore")),
grade.history=as.factor(c("Freshman","Sophomore","Junior","Senior",
"Freshman","Sophomore","Junior",
"Freshman","Sophomore")))
which gives output
> df
ID current.grade grade.history
1 1 Senior Freshman
2 1 Senior Sophomore
3 1 Senior Junior
4 1 Senior Senior
5 2 Junior Freshman
6 2 Junior Sophomore
7 2 Junior Junior
8 3 Sophomore Freshman
9 3 Sophomore Sophomore
> str(df)
'data.frame': 9 obs. of 3 variables:
$ ID : num 1 1 1 1 2 2 2 3 3
$ current.grade: Factor w/ 3 levels "Junior","Senior",..: 2 2 2 2 1 1 1 3 3
$ grade.history: Factor w/ 4 levels "Freshman","Junior",..: 1 4 2 3 1 4 2 1 4
I want to use dplyr
to extract the last value in grade.history
and check it against current.grade
:
df.summary <- df %>%
group_by(ID) %>%
summarize(current.grade.last=last(current.grade),
grade.history.last=last(grade.history))
However, dplyr
seems to convert the factors to integers, so I get this:
> df.summary
Source: local data frame [3 x 3]
ID current.grade.last grade.history.last
1 1 2 3
2 2 1 2
3 3 3 4
> str(df.summary)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 3 variables:
$ ID : num 1 2 3
$ current.grade.last: int 2 1 3
$ grade.history.last: int 3 2 4
Note that the values don't line up because the original factors had different level sets. What's the right way to do this with dplyr
?
I'm using R
version 3.1.1 and dplyr
version 0.3.0.2
Upvotes: 10
Views: 12503
Reputation: 93771
Another way to approach this is to put your factor levels in their natural order, in this case Freshman, Sophomore, Junior, Senior, and then select the highest value for each ID using the which.max
function for indexing. If you do it this way, you won't have to worry about whether your columns are ordered from lowest to highest grade for each ID (as you do if you use the last
function).
library(dplyr)
df <-data.frame(ID=c(1,1,1,1,2,2,2,3,3),
current.grade=as.factor(c("Senior","Senior","Senior","Senior",
"Junior","Junior","Junior",
"Sophomore","Sophomore")),
grade.history=as.factor(c("Freshman","Sophomore","Junior","Senior",
"Freshman","Sophomore","Junior",
"Freshman","Sophomore")))
# Ordered vector of grades
gradeLookup = c("Freshman", "Sophomore", "Junior", "Senior")
# Reset the values in the grade columns to the ordering in gradeLookup
df[,-1] = lapply(df[,-1], function(x) {
factor(x, levels=gradeLookup)
})
# For each ID, select the values of current.grade and grade.history at the maximum
# value of grade.history
df %>% group_by(ID) %>%
summarise(current.grade.last = current.grade[which.max(grade.history)],
grade.history.last = grade.history[which.max(grade.history)])
ID current.grade.last grade.history.last
1 1 Senior Senior
2 2 Junior Junior
3 3 Sophomore Sophomore
UPDATE 2: Since you want to sort and capture the last value (rather than the maximum value) by column, rather than whole rows, try this:
df %>% group_by(ID) %>%
summarise(current.grade.last = current.grade[length(grade.history)],
grade.history.last = grade.history[length(grade.history)])
END UPDATE 2
Does your data include a time variable, like year, term, or academic year? If so, you can dispense with current.grade
and direclty select the value of grade.history
at the most recent year of attendance. This will give you each student's last grade level. For example (assuming your time variable is called year
):
df %>% group_by(ID) %>%
summarise(last.grade = grade.history[which.max(year)])
UPDATE 1: I'm not sure what's causing your code to return the numerical code for each level, rather than the level label. It's not just an issue with the last
function (you can see this if you do last(df$grade.history)
). However, if you want to sort by time-stamp and then return the last row, the code below will keep the level labels. slice
returns the rows you specify within each value of ID
. In this case we specify the last row by using n()
, which returns the total number of rows for each value of ID
.
df.summary <- df %>%
group_by(ID) %>%
slice(n())
Upvotes: 3
Reputation: 54237
I guess it lies in the nature of a factor
object in R, which is set of integer codes with a "levels" attribute of mode character. One way to overcome your problem: Wrap the factor variables into as.character
:
df.summary <- df %>%
group_by(ID) %>%
summarize(current.grade.last=last(as.character(current.grade)),
grade.history.last=last(as.character(grade.history)))
Upvotes: 0