Reputation: 3
I have a data frame that look like this (called df1)
trip_id | station_id | station_name |
---|---|---|
id123 | s01 | A Street |
id385 | s02 | B Street |
id332 | s01 | C Street |
id423 | s01 | A Street |
The problem is there is an inconsistency with the station name column (multiple names for one id) and I want to correct it based on the most popular name used with the same station id. For example, in the table above, all rows with station id = "s01" must have station name = "A Street" (since A Street occurred 2 times and C Street occurred only once). The result should look like this:
trip_id | station_id | station_name |
---|---|---|
id123 | s01 | A Street |
id385 | s02 | B Street |
id332 | s01 | A Street |
id423 | s01 | A Street |
All I'm able to do so far is to extract a list of station id with more than 1 name:
dupl_list <- unique(df1[,c("station_id","station_name")]) %>% group_by (station_id) %>% count() %>% filter(n>1)
Thx for reading
Upvotes: 0
Views: 61
Reputation: 4725
Using base R you can do this:
# data
df1 <- data.frame(trip_id=c('id123', 'id332', 'id385', 'id423'),
station_id=c('s01', 's02', 's01', 's01'),
station_name=c('A Street', 'B Street', 'C Street', 'A Street'),
date_of_trip=c(1, 1, 3, 2))
# most common name for each id (alphabetically lowest in case of ties)
id.name <- tapply(df1$station_name, df1$station_id, function(x) {
tab <- table(x)
names(which.max(tab))
})
df1$station_name <- id.name[df1$station_id]
dplyr
way:
df1 <- df1 %>%
group_by(station_id) %>%
mutate(station_name = names(which.max(table(station_name))))
According to the most recent trip:
df1 <- df1 %>%
group_by(station_id) %>%
mutate(station_name = station_name[which.max(date_of_trip)])
Upvotes: 1