HCAI
HCAI

Reputation: 2263

How to import files from subdirectories and name them with subdirectory name R

I'd like to import files (of different lengths) recursively from sub-directories and put them into one data.frame, having one column with the subdirectory name and one column with the file name (minus the extension):

e.g. folder structure
IsolatedData
  00
    tap-4.out
    cl_pressure.out
  15
    tap-4.out
    cl_pressure.out

So far I have:

setwd("~/Documents/IsolatedData")
l <- list.files(pattern = ".out$",recursive = TRUE)
p <- bind_rows(lapply(1:length(l), function(i) {chars <- strsplit(l[i], "/");
cbind(data.frame(Pressure = read.table(l[i],header = FALSE,skip=2, nrow =length(readLines(l[i])))),
      Angle = chars[[1]][1], Location = chars[[1]][1])}), .id = "id")

But I get an error saying line 43 doesn't have 2 elements.

Also seen this one using dplyr which looks neat but I can't get it to work: http://www.machinegurning.com/rstats/map_df/

tbl <-
  list.files(recursive=T,pattern=".out$")%>% 
  map_df(~data_frame(x=.x),.id="id")

Upvotes: 10

Views: 3259

Answers (3)

Knackiedoo
Knackiedoo

Reputation: 568

I am guessing from your program that your ".out" files consist of a single column of data? If so, you can use scan instead of read.table. I am also guessing that your want the folder name in a column called Angle, the file name (minus extension) in a column called Location, and the data in a column called Pressure. If that is correct, the following should work:

setwd("~/Documents/IsolatedData")
l <- list.files(pattern = "\\.out$", recursive = TRUE)
p <- data.frame()
for (i in seq_along(l)){
  pt <- data.frame(Angle = strsplit(l[i], "/")[[1]][1],
                   Location = sub("\\.out", "", l[i]),
                   Pressure = scan(l[i], skip=2))
  p <- rbind(p, pt)
}

I know this is unfashionable to give an answer that just uses base R, particularly one involving a loop. However, for things like iterating through files in a directory, IMHO it is a perfectly reasonable thing to do, not least for readability and ease of debugging. Of course, as you expect you know, growing an object with rbind in a loop (or apply for that matter) is not a great idea if you are dealing with big data, but I suspect that is not the case here.

Upvotes: 1

camille
camille

Reputation: 16842

Here's a workflow with the map functions from purrr within the tidyverse.

I generated a bunch of csv files to work with to mimic your file structure and some simple data. I threw in 2 lines of junk data at the beginning of each file, since you said you were trying to skip the top 2 lines.

library(tidyverse)

setwd("~/_R/SO/nested")

walk(paste0("folder", 1:3), dir.create)

list.files() %>%
    walk(function(folderpath) {
        map(1:4, function(i) {
            df <- tibble(
                x1 = sample(letters[1:3], 10, replace = T),
                x2 = rnorm(10)
            )
            dummy <- tibble(
                x1 = c("junk line 1", "junk line 2"),
                x2 = c(0)
            )
            bind_rows(dummy, df) %>%
                write_csv(sprintf("%s/file%s.out", folderpath, i))
        })
    })

That gets the following file structure:

├── folder1
|  ├── file1.out
|  ├── file2.out
|  ├── file3.out
|  └── file4.out
├── folder2
|  ├── file1.out
|  ├── file2.out
|  ├── file3.out
|  └── file4.out
└── folder3
   ├── file1.out
   ├── file2.out
   ├── file3.out
   └── file4.out

Then I used list.files(recursive = T) to get a list of the paths to these files, use str_extract to pull text for the folder and file name for each, read the csv file skipping the dummy text, and add the folder and file names so they'll be added to the dataframe.

Since I did this with map_dfr, I get a tibble back, where the dataframes from each iteration are all rbinded together.

all_data <- list.files(recursive = T) %>%
    map_dfr(function(path) {
        # any characters from beginning of path until /
        foldername <- str_extract(path, "^.+(?=/)")
        # any characters between / and .out at end
        filename <- str_extract(path, "(?<=/).+(?=\\.out$)")

        # skip = 3 to skip over names and first 2 lines
        # could instead use col_names = c("x1", "x2")
        read_csv(path, skip = 3, col_names = F) %>%
            mutate(folder = foldername, file = filename)
    })

head(all_data)
#> # A tibble: 6 x 4
#>   X1        X2 folder  file 
#>   <chr>  <dbl> <chr>   <chr>
#> 1 b      0.858 folder1 file1
#> 2 b      0.544 folder1 file1
#> 3 a     -0.180 folder1 file1
#> 4 b      1.14  folder1 file1
#> 5 b      0.725 folder1 file1
#> 6 c      1.05  folder1 file1

Created on 2018-04-21 by the reprex package (v0.2.0).

Upvotes: 8

Jack Brookes
Jack Brookes

Reputation: 3830

Can you try:

library(tidyverse)    

tbl <-
  list.files(recursive = T, pattern = ".out$") %>% 
  map_dfr(read_table, skip = 2, .id = "filepath")

Upvotes: 3

Related Questions