Reputation: 743
I've got a data set with a large number of inconsistently named variables. The features of the current naming's captured in df1, and the desired names in df2. What I'd like is for a variable name such as (initial_salary_D1_H1) to be changed to (initialsalary_D1_H1), in other words for any nonword/nondigit characters before the _D1_H1 substring to be removed.
The _D1_H1 pattern, ie _+anyletter+anynumber, appears at the end of most variable names, although the number of times it appears varies. It may appear zero times (Date.of.Birth), twice (Employee_Date_Start_D1_H1), or three times (Employee_Gender_g2_D1_H1).
There's the added difficulty that the pattern features towards the beginning of some variable names (lunch_p5_type_canteen1_D1_H1).
Finally, there's that in some variable names the last character before the pattern's a word (all.new.Staff_D2_H4), and in some it's a digit (dept_Preferences2_D1_H8).
df1 <- data.frame(Date.of.Birth=c(1,1,1),initial_salary_D1_H1=c(1,1,1),Employee_Gender_g2_D1_H1=c(1,1,1),Employee_Date_Start_D1_H1=c(1,1,1),dept_Preferences2_D1_H8=c(1,1,1),ini_InterviewShortlist2_D1_H6 =c(1,1,1),Retentionpercentage_D7_H19=c(1,1,1),all.new.Staff_D2_H4=c(1,1,1),all_old_D3_H13=c(1,1,1),lunch_p5_type.canteen_D1_H1=c(1,1,1),all_EmploymentStatus4_D2_H11=c(1,1,1))
df2 <- data.frame(DateofBirth=c(1,1,1),initialsalary_D1_H1=c(1,1,1),EmployeeGender_g2_D1_H1=c(1,1,1),EmployeeDateStart_D1_H1=c(1,1,1),deptPreferences2_D1_H8=c(1,1,1),iniInterviewShortlist2_D1_H6 =c(1,1,1),Retentionpercentage_D7_H19=c(1,1,1),allnewStaff_D2_H4=c(1,1,1),allold_D3_H13=c(1,1,1),lunchp5typecanteen_D1_H1=c(1,1,1),allEmploymentStatus4_D2_H11=c(1,1,1))
names(df1)
names(df2)
What I've been working on, but am completely new to pattern matching. Is using substring() along with gsub() an option?
substring(names(df1), regexpr("*_\w\d$i*", names(df1)) - 1)
Thanks for any help!
UPDATE:
How do I stick it in a function and save the output as data frames (I've got three data frames)?
Upvotes: 2
Views: 216
Reputation: 5951
There should be a smaller version of this but i think it would do the job
paste0(
gsub("[_.]", "", gsub("(.*[a-zA-Z]{2,}[0-9]*).*", "\\1", names(df1))),
gsub(".*[a-zA-Z]{2,}[0-9]*", "", names(df1)))
Explanation
this part gsub("(.*[a-zA-Z]{2,}[0-9]*).*", "\\1", names(df1))
keeps any characters up to the point where they end with two at least two characters and zero or more numbers. Lets call this X
this part gsub("[_.]", "", X)
removes the _
and .
from X
and the last part just replaces X
with nothing.
And paste everything together
use it in a function and save the output as data frame
normalize_names <- function(df){
if(!(class(df1) == "data.frame")) stop("Object not a data.frame")
new_names <- paste0(
gsub("[_.]", "", gsub("(.*[a-zA-Z]{2,}[0-9]*).*", "\\1", names(df))),
gsub(".*[a-zA-Z]{2,}[0-9]*", "", names(df)))
names(df) <- new_names
df
}
df1 <- normalize_names(df1)
df_list <- lapply(list(df1, df2, df3), normalize_names)
Upvotes: 1