OLu
OLu

Reputation: 77

Reading multiple csv files that contains a particular text from a directory into r

I have imported a set of .csv files from a directory into a list file like below.

features <- c("ASB_aggregated_by_lsoa_2015.csv", "ASB_aggregated_by_lsoa_2019.csv", "ASB_aggregated_by_lsoa_2022.csv", "Bicycle_theft_aggregated_by_lsoa_2015.csv", "Bicycle_theft_aggregated_by_lsoa_2019.csv", "Bicycle_theft_aggregated_by_lsoa_2022.csv", "Burglary_aggregated_by_lsoa_2015.csv", "Burglary_aggregated_by_lsoa_2019.csv", "Burglary_aggregated_by_lsoa_2022.csv", "Criminal_damage_and_arson_aggregated_by_lsoa_2015.csv", "Criminal_damage_and_arson_aggregated_by_lsoa_2019.csv", "Criminal_damage_and_arson_aggregated_by_lsoa_2022.csv", "Drugs_aggregated_by_lsoa_2015.csv", "Drugs_aggregated_by_lsoa_2019.csv", "Drugs_aggregated_by_lsoa_2022.csv", "Other_crime_aggregated_by_lsoa_2015.csv", "Other_crime_aggregated_by_lsoa_2019.csv", "Other_crime_aggregated_by_lsoa_2022.csv", "Other_theft_aggregated_by_lsoa_2015.csv", "Other_theft_aggregated_by_lsoa_2019.csv", "Other_theft_aggregated_by_lsoa_2022.csv")

The files for each year have a similar column, but the number of cases in each file are different. First, I would like to combine all the files for similar years (2015; 2019 and 2022) into the same dataframe and do a full join based on their similar column. At the end, I would have 3 dataframes for each year.

Thank you

Upvotes: 0

Views: 46

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 270428

First set up some test files and create features. Now read the files into DFs and create a years vector that gives the corresponding year for each entry in DFs splitting DFs by years giving spl. Then for each component of the split perform the merge.

# set up test files
for(year in c(2015, 2019, 2022))
  for(let in letters[1:3])
    write.csv(BOD + year + match(let, letters), paste0("BOD-", year, let, ".csv"),
      row.names = FALSE, quote = FALSE)
features <- dir()

# we now have features and the test files so process them
DFs <- Map(read.csv, features)
years <- gsub("\\D", "", features)
spl <- split(DFs, years)
L <- lapply(spl, \(x) Reduce(function(x, y) merge(x, y, by="Time", all=TRUE), x))

str(L) # show structure of result

giving:

List of 3
 $ 2015:'data.frame':   9 obs. of  4 variables:
  ..$ Time    : int [1:9] 2017 2018 2019 2020 2021 2022 2023 2024 2025
  ..$ demand.x: num [1:9] 2024 2026 2035 2032 2032 ...
  ..$ demand.y: num [1:9] NA 2025 2027 2036 2033 ...
  ..$ demand  : num [1:9] NA NA 2026 2028 2037 ...
 $ 2019:'data.frame':   9 obs. of  4 variables:
  ..$ Time    : int [1:9] 2021 2022 2023 2024 2025 2026 2027 2028 2029
  ..$ demand.x: num [1:9] 2028 2030 2039 2036 2036 ...
  ..$ demand.y: num [1:9] NA 2029 2031 2040 2037 ...
  ..$ demand  : num [1:9] NA NA 2030 2032 2041 ...
 $ 2022:'data.frame':   9 obs. of  4 variables:
  ..$ Time    : int [1:9] 2024 2025 2026 2027 2028 2029 2030 2031 2032
  ..$ demand.x: num [1:9] 2031 2033 2042 2039 2039 ...
  ..$ demand.y: num [1:9] NA 2032 2034 2043 2040 ...
  ..$ demand  : num [1:9] NA NA 2033 2035 2044 .

Upvotes: 0

Niklas
Niklas

Reputation: 166

You can do what you would want to do by creating a data frame with the filenames, extracting the year (e.g mutate(year=str_extract(filename,"20[0-9]+"))) and then proceeding with group_by(year) and map_dfr() to carry out the other operations on a grouped dataframe. This would give you a single data frame with years, you can subset that manually. Without the files or knowing at least their structure, it is difficult to go into more details. Also, it seems like you have "Census_combined_data" for years 2011 and 2021, but no other data for those years, that would of course break the workflow since you are missing files and have files with a different structure.

Upvotes: 0

Related Questions