Reputation: 83

R Looping through multiple similar dataframes and remove partial strings from columns of each DF

I have multiple csv files namely test9 to test15. All the file has the similar structure below

data

structure(list(x = structure(c(7L, 7L, 8L, 8L, 9L, 9L, 5L, 5L, 
1L, 2L, 3L, 4L, 6L), .Label = c("CASE TEST 9 1A", "CASE TEST 9 2A", 
"CASE TEST 9 3A", "CASE TEST 9 4A", "CASE TEST 9 BOX ", "CASE TEST 9 Delta", 
"CASE TEST 9 Inlet ", "CASE TEST 9 Outlet ", "CASE TEST 9 Total "
), class = "factor"), y = structure(c(9L, 10L, 10L, 3L, 3L, 5L, 
4L, 1L, 6L, 8L, 2L, 8L, 7L), .Label = c("CASE TEST 9 1A", "CASE TEST 9 4A", 
"CASE TEST 9 BOX ", "CASE TEST 9 BOX Flow", "CASE TEST 9 BOX Total ", 
"CASE TEST 9 Gett", "CASE TEST 9 Inlet", "CASE TEST 9 Outlet", 
"CASE TEST 9 Outlet ", "CASE TEST 9 Total "), class = "factor"), 
    cdf = c(0.944604888, 0.874809846, 0.691878303, 0.524043157, 
    0.943973621, 0.687799043, 0.5105407, 0.989077927, -0.58787388, 
    0.036798229, 0.780733881, 0.116990659, 0.770803109), abscorr = c(0.944604888, 
    0.874809846, 0.691878303, 0.524043157, 0.943973621, 0.687799043, 
    0.5105407, 0.989077927, 0.58787388, 0.036798229, 0.780733881, 
    0.116990659, 0.770803109)), .Names = c("x", "y", "cdf", "abscorr"
), class = "data.frame", row.names = c(NA, -13L))

I like to loop through each files and remove the text "CASE TEST 'z'" in column x and y. I understand I can use gsub to remove them for each dataframe, but if I were to use a loop, how I can make sure that the string to remove will change accordingly to the file name.

test9: remove "CASE TEST 9"
test10: remove "CASE TEST 10"
etc etc.

Output should look something like this for each dataframes:

Appreciate all your help.

Upvotes: 1

Answers (3)

Martin Gal

Reputation: 16978

How would my workflow on this look like?

Create a list of .csv files to be read in. If there are multiple files I prefer storing them in a list instead of multiple, single data.frames. A list is easier to handle and we can avoid get-assign-structures in an automated/loop workflow.

# set work directory if necessary
# setwd("C:/testfiles")

files <- list.files(pattern = "csv$")

Read in those .csv files. I prefer using readr but base Rs read.csv is okay.

# readr version
# test_files <- lapply(files, function(x) readr::read_csv2(x))
test_files <- lapply(files, function(x) read.csv2(x))

Name the list.

names(test_files) <- gsub("\\.csv", "", files)

Clean up the data.frames.

# base R version
result <- Map(
  function(file) {
    transform(file, 
              x = gsub("(^\\s*|CASE TEST \\d+\\s*|\\s*$)", "", x),
              y = gsub("(^\\s*|CASE TEST \\d+\\s*|\\s*$)", "", y)
              )
    },
  test_files
  )

# tidyverse version
library(dplyr)
library(stringr)
library(purrr)

result <- test_files %>%
  map(
    ~.x %>%
      mutate(
        across(c(x, y), ~str_trim(str_remove(.x, "CASE TEST \\d+\\s*")))
      )
  )

Access each data.frame in the list.

# test9
result[["test9"]]
# or
result$test9


# test10
result[["test10"]]
# or
result$test10

Upvotes: 2

Anoushiravan R

Reputation: 21908

I think this could also work fine for your purpose. I noticed there are some trailing white spaces in your values in variable x and y so the last line of code just makes sure they are removed.

library(dplyr)

df %>% 
  mutate(across(c(x, y), ~ 
    gsub("CASE\\sTEST\\s\\d+\\s(.*)", "\\1", .x, perl = TRUE))) %>%
  mutate(across(c(x, y), ~ gsub("(.*)\\s$?", "\\1", .x)))

        x         y         cdf    abscorr
1   Inlet    Outlet  0.94460489 0.94460489
2   Inlet     Total  0.87480985 0.87480985
3  Outlet     Total  0.69187830 0.69187830
4  Outlet       BOX  0.52404316 0.52404316
5   Total       BOX  0.94397362 0.94397362
6   Total BOX Total  0.68779904 0.68779904
7     BOX  BOX Flow  0.51054070 0.51054070
8     BOX        1A  0.98907793 0.98907793
9      1A      Gett -0.58787388 0.58787388
10     2A    Outlet  0.03679823 0.03679823
11     3A        4A  0.78073388 0.78073388
12     4A    Outlet  0.11699066 0.11699066
13  Delta     Inlet  0.77080311 0.77080311

Upvotes: 3

DataM

Reputation: 351

I think this is what you're looking for :

my_test9 <- data.frame("x" = c("CASE TEST 9 a", "CASE TEST 9 b"),
                   "y" = c("CASE TEST 9 c", "CASE TEST 9 d"),
                   "cdf" = c(0.9, 0.1),
                   "abscorr" = c(0.9, 0.1),
                   stringsAsFactors = FALSE)

my_test10 <- data.frame("x" = c("CASE TEST 10 a", "CASE TEST 10 b"),
                    "y" = c("CASE TEST 10 c", "CASE TEST 10 d"),
                    "cdf" = c(0.9, 0.1),
                    "abscorr" = c(0.9, 0.1),
                    stringsAsFactors = FALSE)

my_test9$Number <- 9
my_test10$Number <- 10

my_list <- list(my_test9, my_test10)

my_function <- function(x) {
  my_pattern <- paste("CASE TEST ", unique(my_list[[x]]$Number), sep = "")
  my_list[[x]]$x <<- gsub(pattern = my_pattern, replacement = "", x = my_list[[x]]$x)
  my_list[[x]]$y <<- gsub(pattern = my_pattern, replacement = "", x = my_list[[x]]$y)
}
my_var <- lapply(1:length(my_list), my_function)

my_test9 <- my_list[[1]]
my_test10 <- my_list[[2]]

Upvotes: 2

R Looping through multiple similar dataframes and remove partial strings from columns of each DF

Answers (3)

Related Questions