Reputation: 107
I have a very messy dataframe that looks like
df <- data.frame(Job = c("casual", "part time", "full time", "Level A total" , "casual","full time","Level B total"), institute1 = c(1,2,2,5,0,1,1))
Where the rows above "Level B total" refer to level B, until going up the rows you get to "level A total" where it now refers to level A. The data is >500 lines long so manually cleaning it is an option but an unpleasant one, but I cant think of how to code it so I can add information so R knows what Level each cell is referring to.
Upvotes: 0
Views: 89
Reputation: 5798
Base R solution:
transform(within(df[rev(seq_len(nrow(df))),],
{
Level <- ifelse(grepl("Level\\s*[A-Z]", Job),
gsub("\\s*total", "", Job), NA_character_)
}
), Level = na.omit(Level)[cumsum(!(is.na(Level)))])[rev(seq_len(nrow(df))),]
Upvotes: 0
Reputation: 389325
We can create a new column Level
and put all the "Level"
values in it. fill
the NA
values with the non-NA value below it. Clean up the Level
column by adding text from Job
.
library(dplyr)
df %>%
mutate(Level = replace(Job, !grepl('Level', Job), NA)) %>%
tidyr::fill(Level, .direction = 'up') %>%
mutate(Level = ifelse(grepl('total', Job),
Job, paste0(sub('total', '', Level), Job)))
# Job institute1 Level
#1 casual 1 Level A casual
#2 part time 2 Level A part time
#3 full time 2 Level A full time
#4 Level A total 5 Level A total
#5 casual 0 Level B casual
#6 full time 1 Level B full time
#7 Level B total 1 Level B total
Upvotes: 2