Reputation: 3
I am trying to split some columns in a dataframe by "." then rename the splited columns based on the original names. Original dataset Expect result
library(ISLR)
wage <- Wage #sample dataset from ISLR
wage_term_ref <- as.data.frame(wage[,3:9]) # These are the columns I need to split
colnames(wage_term_ref)
"maritl" "race" "education" "region" "jobclass" "health" "health_ins"
wage_term_ref[] <- lapply(wage_term_ref, as.character) # change all from factor to character
martil<- data.frame(do.call(rbind, strsplit(wage_term_ref$maritl, "[.]" ))) # split the first columne
names(martil)<-c("martil_Index","martil_Status") # rename the splited columns based on the original name "martil"
Then I need to repeat the same to the balance 6 columns in wage_term_ref. Finally combine all _Index columns (eg.martil_Index) & wage[,1:2] to a new dataframe "wage_updated"
Would anyone have a better way to do this? Maybe a loop? Thanks in advance.
Upvotes: 0
Views: 229
Reputation: 5788
Base R solution:
#install.packages("ISLR", dependencies = TRUE)
library(ISLR)
# Import dataset from ISLR library, store copy: wage => data.frame
wage <- Wage #sample dataset from ISLR
# Boolean vector of vectors to split: split_col => vector of booleans
split_col <- sapply(wage, function(x){if(!(is.numeric(x))){any(grepl("[.]", x))}else{FALSE}})
# Create an empty list to store subsets from the wage data.frame: wage_list => list
wage_list <- vector("list", ncol(wage[,split_col]))
# Populate the list with vectors split on ".": wage_list => list
wage_list <- lapply(wage[,split_col], function(y){strsplit(as.character(y), "[.]")})
# Calculate the maximum number of vectors per list element in the wage_list:
# max_lengths_per_el => list of named numeric vectors
max_lengths_per_el <- sapply(wage_list, function(z){max(lengths(z))})
# Convert each list element to a data.frame and name appropriately:
# wage_list => list of data.frames
wage_list <- lapply(seq_along(wage_list), function(i){
setNames(data.frame(do.call("rbind", wage_list[[i]]), row.names = NULL),
paste(names(wage_list)[i], 1:max_lengths_per_el[i], sep = "_"))
}
)
# Column bind the original data.set (excluding the columns that have been split)
# with a data.frame of the column-binded lists: wage_df => data.frame
wage_df <- cbind(wage[,!split_col], do.call("cbind", wage_list))
Upvotes: 0
Reputation: 388982
If you want to split multiple columns on the same separator you could use cSplit
from splitstackshape
which simplifies this process.
splitstackshape::cSplit(wage_term_ref, names(wage_term_ref), '.')
This adds _1
, _2
to every column name automatically.
Upvotes: 1