RStudio - Multiple name for one id problem

Question

I have a data frame that look like this (called df1)

trip_id	station_id	station_name
id123	s01	A Street
id385	s02	B Street
id332	s01	C Street
id423	s01	A Street

The problem is there is an inconsistency with the station name column (multiple names for one id) and I want to correct it based on the most popular name used with the same station id. For example, in the table above, all rows with station id = "s01" must have station name = "A Street" (since A Street occurred 2 times and C Street occurred only once). The result should look like this:

trip_id	station_id	station_name
id123	s01	A Street
id385	s02	B Street
id332	s01	A Street
id423	s01	A Street

All I'm able to do so far is to extract a list of station id with more than 1 name:

dupl_list <- unique(df1[,c("station_id","station_name")]) %>% group_by (station_id) %>% count() %>% filter(n>1) Thx for reading

Robert Hacken · Accepted Answer

Using base R you can do this:

# data
df1 <- data.frame(trip_id=c('id123', 'id332', 'id385', 'id423'),
                  station_id=c('s01', 's02', 's01', 's01'),
                  station_name=c('A Street', 'B Street', 'C Street', 'A Street'),
                  date_of_trip=c(1, 1, 3, 2))
# most common name for each id (alphabetically lowest in case of ties)
id.name <- tapply(df1$station_name, df1$station_id, function(x) {
  tab <- table(x)
  names(which.max(tab))
})

df1$station_name <- id.name[df1$station_id]

dplyr way:

df1 <- df1 %>% 
  group_by(station_id) %>% 
  mutate(station_name = names(which.max(table(station_name))))

According to the most recent trip:

df1 <- df1 %>% 
  group_by(station_id) %>% 
  mutate(station_name = station_name[which.max(date_of_trip)])

RStudio - Multiple name for one id problem

Answers (1)

Related Questions