Removing specific and special characters in multiple defined columns in r dataframe

Question

How to remove specific and special characters in multiple defined columns in r dataframe?

I have a dataset which needs some serious cleaning, similar to the example data below. My end goal is to have a dataframe in which the column types match the column name (e.g., date = date, times = time, numeric1 = numeric, char = character) and in which all of the special characters or values which would otherwise prevent conversion into the correct type are removed. The real dataset contains over 60 columns, about half of which need removal of multiple characters, so I would like to find a way to select the appropriate columns to make multiple modifications with as little code repetition as possible.

Please see example data inputs and desired outputs below. As requested, this image shows the desired output dataframe appearance.

Any help would be appreciated.

#Example Data Input from nested list
my_nested_list <- list(date=c('8/16/2010', '8/17/2010', '8/18/201', '8/19/2010', '8/20/2010'),
                       times=c('8:45', '14:20', '13:00', '15:20', '9:05'),
                       numeric1=c('>3.0','> 3.0','1.2m','.8','?1.8'),
                       character1=c('Epi', '', 'Hypo', 'Epi', ''),
                       character2=c('GC1238', '', 'GC1239', '', 'GC1240'),
                       numeric2=c('N/A', '-', '8.9', '', '2.2'),
                       numeric3=c('R', '12.4', '4.1', '', '11.58'))

#Convert nested list to the dataframe by columns
df <- as.data.frame(do.call(cbind, my_nested_list))
df
sapply(df, class) #displays all column types as characters

I know you can use either str_replace_all() or gsub() to replace the special characters in one column or all columns, but I would like to do so in specified columns (e.g., df$numeric1, df$numeric2, df$numeric3 OR df[3], df[5:6]). As in the example data, I also have non-special characters which need to be removed only from the numeric columns (e.g., "R", ">", "> ", "m", "-", "N/A")

This is how I have been replacing the special characters and each individual character, one column at a time, but I am open to ideas. I have also tried doing multiple columns, but it isn't working.

#One column at a time
##Using str_replace_all() in one specified column 
df$numeric1 <- str_replace_all(df$numeric1, "$$\[:punct:$$\]", " ")
df$numeric1 <- str_replace_all(df$numeric1, ""R", " ")

##Using base gsub() in one specified column
gsub("$$\[:punct:$$\]", " ", df$numeric1)
gsub("R", " ", df$numeric1)

#One character in multiple columns with resulting error: "Error in `[.default`(df, cols_nan) : invalid subscript type 'list'"
cols_nan <- c(df[5:6])
df[cols_nan] <- gsub('[^[:alnum:] ]','',df[cols_nan])

I have tried a few ways to change the date and time column types (per previous stackoverflow posts). While they are changing the column type, they are returning no values in the column.

#dates
df$date <- lubridate::mdy(df$date)
df$date <- as.Date(df$date, format= %Y-%m-%d")
df$date <- lubridate::mdy(df$date)

#time
df[['times']] <- strptime(df[['times']], format = "%H:%M:%S")

When making the modifications one character and one column at a time, the column types change using the following.

df <-data.frame(df,  stringsAsFactors=FALSE)
sapply(df, class) #I would like to display all column types as designated in headers

Removing specific and special characters in multiple defined columns in r dataframe

Answers (1)

Related Questions