Reputation: 1
How to remove specific and special characters in multiple defined columns in r dataframe?
I have a dataset which needs some serious cleaning, similar to the example data below. My end goal is to have a dataframe in which the column types match the column name (e.g., date = date, times = time, numeric1 = numeric, char = character) and in which all of the special characters or values which would otherwise prevent conversion into the correct type are removed. The real dataset contains over 60 columns, about half of which need removal of multiple characters, so I would like to find a way to select the appropriate columns to make multiple modifications with as little code repetition as possible.
Please see example data inputs and desired outputs below. As requested, this image shows the desired output dataframe appearance.
Any help would be appreciated.
#Example Data Input from nested list
my_nested_list <- list(date=c('8/16/2010', '8/17/2010', '8/18/201', '8/19/2010', '8/20/2010'),
times=c('8:45', '14:20', '13:00', '15:20', '9:05'),
numeric1=c('>3.0','> 3.0','1.2m','.8','?1.8'),
character1=c('Epi', '', 'Hypo', 'Epi', ''),
character2=c('GC1238', '', 'GC1239', '', 'GC1240'),
numeric2=c('N/A', '-', '8.9', '', '2.2'),
numeric3=c('R', '12.4', '4.1', '', '11.58'))
#Convert nested list to the dataframe by columns
df <- as.data.frame(do.call(cbind, my_nested_list))
df
sapply(df, class) #displays all column types as characters
I know you can use either str_replace_all() or gsub() to replace the special characters in one column or all columns, but I would like to do so in specified columns (e.g., df$numeric1, df$numeric2, df$numeric3 OR df[3], df[5:6]). As in the example data, I also have non-special characters which need to be removed only from the numeric columns (e.g., "R", ">", "> ", "m", "-", "N/A")
This is how I have been replacing the special characters and each individual character, one column at a time, but I am open to ideas. I have also tried doing multiple columns, but it isn't working.
#One column at a time
##Using str_replace_all() in one specified column
df$numeric1 <- str_replace_all(df$numeric1, "\[\[:punct:\]\]", " ")
df$numeric1 <- str_replace_all(df$numeric1, ""R", " ")
##Using base gsub() in one specified column
gsub("\[\[:punct:\]\]", " ", df$numeric1)
gsub("R", " ", df$numeric1)
#One character in multiple columns with resulting error: "Error in `[.default`(df, cols_nan) : invalid subscript type 'list'"
cols_nan <- c(df[5:6])
df[cols_nan] <- gsub('[^[:alnum:] ]','',df[cols_nan])
I have tried a few ways to change the date and time column types (per previous stackoverflow posts). While they are changing the column type, they are returning no values in the column.
#dates
df$date <- lubridate::mdy(df$date)
df$date <- as.Date(df$date, format= %Y-%m-%d")
df$date <- lubridate::mdy(df$date)
#time
df[['times']] <- strptime(df[['times']], format = "%H:%M:%S")
When making the modifications one character and one column at a time, the column types change using the following.
df <-data.frame(df, stringsAsFactors=FALSE)
sapply(df, class) #I would like to display all column types as designated in headers
Upvotes: 0
Views: 255
Reputation: 12538
Seems like you're wanting something like this:
pacman::p_load(tidyverse, hms)
df |> mutate(across(starts_with("num"), parse_number),
across(starts_with("times"), parse_hm),
across(starts_with("date"), ~ as.Date(.x, "%m/%d/%Y")),
across(starts_with("char"), ~ ifelse(.x == "", NA_character_, .x)))
Output:
# A tibble: 6 × 6
date times numeric1 char numeric2 numeric3
<date> <time> <dbl> <chr> <dbl> <dbl>
1 2010-08-16 08:45 3 Epi NA NA
2 2010-08-17 14:20 3 NA NA 12.4
3 0201-08-18 13:00 1.2 Hypo 8.9 4.1
4 2010-08-19 15:20 0.8 Epi NA NA
5 2010-08-20 09:05 1.8 NA 2.2 11.6
6 2010-08-16 08:45 3 NA NA NA
Notes:
across()
for every column type in the dataframe, under the assumption that in the larger dataframe, there's going to be multiple columns for ones which in the sample dataframe have only a single column. Also, if the columns don't have consistent names like in your example, you can use a different way of selecting columns - there's a bunch of ways, too many to list here, but you can either ask for my advice on the real data, or check the documentation.cols_nan <- c(df[5:6])
gets you the actual fifth and sixth columns from the dataframe. Then you try to index the dataframe based on the column, which fails because the column is a list. You probably wanted gsub('[^[:alnum:] ]','',df[5:6])
(but even this produces a janky result)."\[\[:punct:\]\]"
- things need to be escaped twice in regexes in R- i.e. it should be "\\[\\[:punct:\\]\\]"
Upvotes: 0