Inconsistent variable names pattern matching in R

Question

I've got a data set with a large number of inconsistently named variables. The features of the current naming's captured in df1, and the desired names in df2. What I'd like is for a variable name such as (initial_salary_D1_H1) to be changed to (initialsalary_D1_H1), in other words for any nonword/nondigit characters before the _D1_H1 substring to be removed.

The _D1_H1 pattern, ie _+anyletter+anynumber, appears at the end of most variable names, although the number of times it appears varies. It may appear zero times (Date.of.Birth), twice (Employee_Date_Start_D1_H1), or three times (Employee_Gender_g2_D1_H1).

There's the added difficulty that the pattern features towards the beginning of some variable names (lunch_p5_type_canteen1_D1_H1).

Finally, there's that in some variable names the last character before the pattern's a word (all.new.Staff_D2_H4), and in some it's a digit (dept_Preferences2_D1_H8).

df1 <- data.frame(Date.of.Birth=c(1,1,1),initial_salary_D1_H1=c(1,1,1),Employee_Gender_g2_D1_H1=c(1,1,1),Employee_Date_Start_D1_H1=c(1,1,1),dept_Preferences2_D1_H8=c(1,1,1),ini_InterviewShortlist2_D1_H6 =c(1,1,1),Retentionpercentage_D7_H19=c(1,1,1),all.new.Staff_D2_H4=c(1,1,1),all_old_D3_H13=c(1,1,1),lunch_p5_type.canteen_D1_H1=c(1,1,1),all_EmploymentStatus4_D2_H11=c(1,1,1))
df2 <- data.frame(DateofBirth=c(1,1,1),initialsalary_D1_H1=c(1,1,1),EmployeeGender_g2_D1_H1=c(1,1,1),EmployeeDateStart_D1_H1=c(1,1,1),deptPreferences2_D1_H8=c(1,1,1),iniInterviewShortlist2_D1_H6 =c(1,1,1),Retentionpercentage_D7_H19=c(1,1,1),allnewStaff_D2_H4=c(1,1,1),allold_D3_H13=c(1,1,1),lunchp5typecanteen_D1_H1=c(1,1,1),allEmploymentStatus4_D2_H11=c(1,1,1))
names(df1)
names(df2)

What I've been working on, but am completely new to pattern matching. Is using substring() along with gsub() an option?

substring(names(df1), regexpr("*_\w\d$i*", names(df1)) - 1)

Thanks for any help!

UPDATE:

How do I stick it in a function and save the output as data frames (I've got three data frames)?

dimitris_ps · Accepted Answer

There should be a smaller version of this but i think it would do the job

paste0(
    gsub("[_.]", "", gsub("(.*[a-zA-Z]{2,}[0-9]*).*", "\1", names(df1))), 
    gsub(".*[a-zA-Z]{2,}[0-9]*", "", names(df1)))

Explanation

this part gsub("(.*[a-zA-Z]{2,}[0-9]*).*", "\1", names(df1)) keeps any characters up to the point where they end with two at least two characters and zero or more numbers. Lets call this X

this part gsub("[_.]", "", X) removes the _ and . from X

and the last part just replaces X with nothing.

And paste everything together

Update

use it in a function and save the output as data frame

normalize_names <- function(df){

  if(!(class(df1) == "data.frame")) stop("Object not a data.frame")

  new_names <- paste0(
    gsub("[_.]", "", gsub("(.*[a-zA-Z]{2,}[0-9]*).*", "\1", names(df))), 
    gsub(".*[a-zA-Z]{2,}[0-9]*", "", names(df)))

  names(df) <- new_names
 df

}

Example

df1 <- normalize_names(df1)
df_list <- lapply(list(df1, df2, df3), normalize_names)

Inconsistent variable names pattern matching in R

Answers (1)

Update

Example

Related Questions