Charlie Mintz
Charlie Mintz

Reputation: 79

Trying to build vector of specific files within folders (R)

Edit: I think I solved this by using the getAbsolutePath function in the R.utils package.

wage_data_files <- list.files("WageData", full.names=TRUE, recursive=TRUE)
all_files <- grep(paste(toMatch, collapse = "|"), wage_data_files, value=TRUE)
files_vector <- vector()
for (i in seq_along(all_files){
files_vector <- c(files_vector, getAbsolutePath(all_files[i]))}

Thanks again for all the help.

#

I'm trying to extract a subset of .csv files within a collection of folders. I want to put them all into a vector, then extract a specific value from each file and place that value into a vector. The code my question is about regards how to get all the files I wish to extract values from into a vector, which I can then run a for loop over to extract my desired values and place them into vectors.

This is the folder structure:

Desktop -> WageData -> 21 folders (sic.1980.annual.by.area; sic.1981.annual.by area, up to sic.2000.annual.by.area) -> In each of the above folders are roughly 1000 .csv files.

I'm trying to extract six of those .csv files in each folder: "Idaho -- Statewide", "Indiana -- Statewide", "Michigan -- Statewide", "Oklahoma -- Statewide", "Texas -- Statewide" and "Wisconsin -- Statewide"

So there are a total of 126 files: 6 for each year, for 21 years. Here are a couple examples of what the specific files are named:

sic.1980.annual 40000 (Oklahoma -- Statewide)

sic.1980.annual 55000 (Wisconsin -- Statewide)

Here is my code:

setwd("~/Desktop")
wage_data_files <- list.files("WageData", full.names=TRUE)
for (i in seq_along(wage_data_files)){
year_files <- list.files(wage_data_files[i])
toMatch <- c("Idaho -- Statewide", "Indiana -- Statewide", "Michigan --  Statewide", 
         "Oklahoma -- Statewide", "Texas -- Statewide", "Wisconsin -- Statewide")
dat <- data.frame()
states_vector1 <- c(dat, grep(paste(toMatch, collapse = "|"), year_files, value=TRUE))
print(states_vector1)}

One problem I'm running into right away, when I try to debug, is that I can't get my results to print out correctly. When I put the curly brackets after the print statement, I get a list like this:

[[1]] [1] "sic.1980.annual 16000 (Idaho -- Statewide).csv"

[[2]] [1] "sic.1980.annual 18000 (Indiana -- Statewide).csv"

[[3]] [1] "sic.1980.annual 26000 (Michigan -- Statewide).csv"

[[4]] [1] "sic.1980.annual 40000 (Oklahoma -- Statewide).csv"

[[5]] [1] "sic.1980.annual 48000 (Texas -- Statewide).csv"

[[6]] [1] "sic.1980.annual 55000 (Wisconsin -- Statewide).csv"

[[1]] [1] "sic.1981.annual 16000 (Idaho -- Statewide).csv"

As you can see, it repeats after 6, even though wage_data_files is a vector of length 21.

So my first problem is getting the desired files into a vector. My second problem is how to run a for loop reading those files in and then extracting my desired value. The issue I'm running into is how to set the working directory. Because, for the above function, the working directory is desktop. But to get the read.csv function to work, I'd have to set the working directory to each individual folder (e.g. "WageData/sic.1980.annual.by_area", "WageData/sic.1981.annual.by_area", etc...)

Does anyone have any suggestions?

Thank you.

Upvotes: 3

Views: 1667

Answers (2)

bene
bene

Reputation: 151

The reason it 'repeats after 6' is that you are creating a new data frame in each loop which causes any existing data to be deleted. You need to initialize the data frame (or vector) before the loop. Here is a possible implementation which also answers your second question:

root_directory <- "~/Desktop/WageData"
toMatch <- c("Idaho -- Statewide", "Indiana -- Statewide", "Michigan --  Statewide", 
             "Oklahoma -- Statewide", "Texas -- Statewide", "Wisconsin -- Statewide")

folders <- list.files(root_directory, full.names = TRUE)

# initialize state_vector1 as an empty vector
states_vector1 <- c()

# loop over folders and get the full path of each file matching a pattern in the toMatch vector
for (folder in folders){
  year_files <- list.files(folder)

  # get the names of matching files, e.g. "Indiana -- Statewide.csv"
  matches <- grep(paste(toMatch, collapse = "|"), year_files, value=TRUE)

  # prepend the path to the directory to get the full path to each file
  # to get e.g. "~/Desktop/WageData/sic.1980.annual.by.area/Wisconsin -- Statewide.csv"
  matches <- vapply(matches, function(x) {file.path(folder, x)}, "", USE.NAMES = FALSE)

  # append the new matches to states_vector1
  states_vector1 <- c(states_vector1, matches)
}

# now you can loop over the vector containing the full path to each file
n_files <- length(states_vector1)
extracted_values <- rep(NA, n_files)
for (i in 1:n_files) {
  file_content <- read.csv(states_vector1[i])

  # create a function `extract_value()` which extracts the information you need from each file
  extracted_values[i] <- extract_value(file_content)
}

Testing this by setting up the following directory structure:

~/Desktop/WageData/sic.1980.annual.by.area/

~/Desktop/WageData/sic.1981.annual.by.area/

where each directory has all six csv files, I get the following output:

> states_vector1
 [1] "/Users/bene/Desktop/WageData/sic.1980.annual.by.area/Idaho -- Statewide.csv"    
 [2] "/Users/bene/Desktop/WageData/sic.1980.annual.by.area/Indiana -- Statewide.csv"  
 [3] "/Users/bene/Desktop/WageData/sic.1980.annual.by.area/Michigan -- Statewide.csv" 
 [4] "/Users/bene/Desktop/WageData/sic.1980.annual.by.area/Oklahoma -- Statewide.csv" 
 [5] "/Users/bene/Desktop/WageData/sic.1980.annual.by.area/Texas -- Statewide.csv"    
 [6] "/Users/bene/Desktop/WageData/sic.1980.annual.by.area/Wisconsin -- Statewide.csv"
 [7] "/Users/bene/Desktop/WageData/sic.1981.annual.by.area/Idaho -- Statewide.csv"    
 [8] "/Users/bene/Desktop/WageData/sic.1981.annual.by.area/Indiana -- Statewide.csv"  
 [9] "/Users/bene/Desktop/WageData/sic.1981.annual.by.area/Michigan -- Statewide.csv" 
[10] "/Users/bene/Desktop/WageData/sic.1981.annual.by.area/Oklahoma -- Statewide.csv" 
[11] "/Users/bene/Desktop/WageData/sic.1981.annual.by.area/Texas -- Statewide.csv"    
[12] "/Users/bene/Desktop/WageData/sic.1981.annual.by.area/Wisconsin -- Statewide.csv"

Upvotes: 3

Rorschach
Rorschach

Reputation: 32466

You could try this (hard to test if this will work). You can get the full path name from list.files so you can just use that as the file name for read.csv. I converted the for loop into a couple of apply loops

## Doesn't need to be in loop
toMatch <- c("Idaho -- Statewide", "Indiana -- Statewide", "Michigan --  Statewide", 
             "Oklahoma -- Statewide", "Texas -- Statewide", "Wisconsin -- Statewide")

results <- lapply(wage_data_files, function(folder) {
    year_files <- list.files(folder, full.names=T)  # get full file names (w/ path)
    states_vector1 <- grep(paste(toMatch, collapse = "|"), year_files, value=TRUE)

    ## Get a value from these files
    sapply(states_vector1, function(fname) {
        val <- read.csv(fname)[1,1]  # get the first value
    })
})

Here, the return value (stored in results) should be a list of vectors. Each element of the list would contain the results extracted from one of the year folders.

Upvotes: 1

Related Questions