Extract variables with specific names and loop for all files

Question

The aim is to (1) extract variables (columns) with specific names (2) loop for all files in the folder and (3) save the output in csv files.

For example, the data look like for one file:

structure(list(frame = c(1, 2, 3, 4, 5), g_1 = c(0.1, 0.2, 0.1, 
0, 0), g_2 = c(0, 0, 0, 0, 0), xy_1 = c(0.4, 0.43, 0.32, 0.33, 
0.39), xy_2 = c(0, 0, 0, 0, 0.1), unit_1 = c(1, 1, 1, 1, 1), 
    unit_2 = c(0, 0, 0, 0, 0), unit_3 = c(0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, 
-5L), variable.labels = structure(character(0), .Names = character(0)), codepage = 65001L)

For each dataset, I only need the variables (1) frame and (2) those starting with "unit_". So the output I need to get and save in csv for each file looks like:

╔═══════╦════════╦════════╦════════╗
║ frame ║ unit_1 ║ unit_2 ║ unit_3 ║
╠═══════╬════════╬════════╬════════╣
║  1.00 ║  1.00  ║   .00  ║   .00  ║
╠═══════╬════════╬════════╬════════╣
║  2.00 ║  1.00  ║   .00  ║   .00  ║
╠═══════╬════════╬════════╬════════╣
║  3.00 ║  1.00  ║   .00  ║   .00  ║
╠═══════╬════════╬════════╬════════╣
║  4.00 ║  1.00  ║   .00  ║   .00  ║
╠═══════╬════════╬════════╬════════╣
║  5.00 ║  1.00  ║   .00  ║   .00  ║
╚═══════╩════════╩════════╩════════╝

I'm wondering if there is a way to do this using dplyr or others.

I was thinking of:

files <- list.files(path="C:/files", pattern="csv")
out <- lapply(files, function(file) {
  dat <- data.frame(fread(paste0("C:/files/", file) ) ) 
  dat <- dat[c("frame", "unit_1", "unit_2", "unit_3")]})

From here, I was not clear how to save each file into csv files.

akrun · Accepted Answer

We use select with select_helpers like starts_with

library(dplyr)
df1 %>% 
     select(frame, starts_with('unit'))

-output

#   frame unit_1 unit_2 unit_3
#1     1      1      0      0
#2     2      1      0      0
#3     3      1      0      0
#4     4      1      0      0
#5     5      1      0      0

Or a regex in matches to match the pattern that starts (^) with the substring 'unit' followed by a _ and one or more digits (\d+) at the end ($) of the string as column names

df1 %>%
      select(frame, matches('^unit_\d+$'))

If it is a list of data.frame, loop over the list with map/walk, read the files with read_csv, select the columns of interest and create new files in the same directory with prefix 'new_' using write_csv

library(purrr)
library(readr)
library(stringr)
walk(files, ~ read_csv(file.path("C:/files", .x)) %>%
               select(frame, starts_with('unit')) %>%
               write_csv(file.path("C:/files", str_c("new_", .x)))
       )

The equivalent option in base R is startsWith

subset(df1, select = c('frame',
          names(df1)[startsWith(names(df1), 'unit')]))

Or with grep

subset(df1, select = grep('^(frame|unit)', names(df1), value = TRUE))

and for the list of files

lapply(files, function(x) {
      tmp <-  subset(read.csv(file.path("C:/files", x)),
              select = grep('^(frame|unit)', names(df1), value = TRUE))
    write.csv(tmp, file.path("C:/files", paste0("new_", x)), 
                 row.names = FALSE, quote = FALSE)
    })

Extract variables with specific names and loop for all files

Answers (1)

Related Questions