JASC
JASC

Reputation: 196

R: Count the number of files with a specific extension in different sub-directories/folders

I have a bibliographic directory/folder (/Biblio) with 66 subdirectories/folders (/01 folder, /02 folder, … /66 folder) that contain a different number of files with different extensions (e.g. pdf, txt, csv, …), and subfolders with files with similar extensions, but I am not interested on the information of the these sub-subfolders. Some subfolders do not have any “pdf” file. I want to count the number of “pdf” files in each subfolder.

I can list the pdf files in all subfolders of “/Biblio” with:

BiblioPath = "C:/Biblio"
BiblioDir = list.dirs(path = BiblioPath, full.names = TRUE, recursive = FALSE)
BiblioFiles = list.files(path = BiblioDir, pattern = "pdf", recursive = FALSE, full.names = TRUE) 

(Note: the string “pdf” does never occur in my filenames). “BiblioFiles” is the full list of the pdf files, but I do not know how to count how many “pdf” files are in each subdirectory without a loop.

Upvotes: 4

Views: 9664

Answers (3)

JASC
JASC

Reputation: 196

I thank @Richard Border and @alistaire for their prompt, similar, simple and elegant answers. As they have been posted as comments, I have decided to copy as answer the one that I like more:

sapply(BiblioDir,function(dir){length(list.files(dir,pattern='pdf'))})

It works perfectly and I like the absence of explicit loops.

Upvotes: 3

hrbrmstr
hrbrmstr

Reputation: 78832

tidyverse:

library(tidyverse)

fils <- list.files("~/Development", pattern="pdf$", full.names = TRUE, recursive = TRUE)

data_frame(
  dir = dirname(fils)
) %>% 
  count(dir) %>% 
  mutate(dir = map_chr(dir, digest::digest)) # you don't need to see my dir names so just remove this from your work

## # A tibble: 14 x 2
##                                 dir     n
##                               <chr> <int>
##  1 06e6c4fed6e941d00c04cae3bd24888b    18
##  2 98bf27d6686a52772cb642a136473d86     9
##  3 c07bfc45ce148933269d7913e1c5e833     1
##  4 84088c9c18b0eb10478f17870886b481     1
##  5 baeb85661aad8bff2f2b52cb55f14ede     1
##  6 c484306deae0a70b46854ede3e6b317a    22
##  7 70750a506855c6c6e09f8bdff32550f8     4
##  8 8c5cbe2598f1f24f1549aaafd77b14c9     1
##  9 9008083601c1a75def1d1418d8acf39e     1
## 10 0c25ef8d27250f211d56eff8641f8beb     1
## 11 3e30987a34a74cb6846abc51e48e7f9e     1
## 12 e71c330b185bf4974d26d5379793671b     1
## 13 fe2e8912e58ba889cf7c6c3ec565b2ee     4
## 14 e07698c59f5c11ac61e927e91c2e8493    27

base:

fils <- list.files("~/Development", pattern="pdf$", full.names = TRUE, recursive = TRUE)
dirs <- dirname(fils)
dirs <- sapply(dirs,digest::digest) # you don't need to see my dir names so just remove this from your work
as.data.frame(table(dirs))
##                                dirs Freq
## 1  06e6c4fed6e941d00c04cae3bd24888b   18
## 2  0c25ef8d27250f211d56eff8641f8beb    1
## 3  3e30987a34a74cb6846abc51e48e7f9e    1
## 4  70750a506855c6c6e09f8bdff32550f8    4
## 5  84088c9c18b0eb10478f17870886b481    1
## 6  8c5cbe2598f1f24f1549aaafd77b14c9    1
## 7  9008083601c1a75def1d1418d8acf39e    1
## 8  98bf27d6686a52772cb642a136473d86    9
## 9  baeb85661aad8bff2f2b52cb55f14ede    1
## 10 c07bfc45ce148933269d7913e1c5e833    1
## 11 c484306deae0a70b46854ede3e6b317a   22
## 12 e07698c59f5c11ac61e927e91c2e8493   27
## 13 e71c330b185bf4974d26d5379793671b    1
## 14 fe2e8912e58ba889cf7c6c3ec565b2ee    4

Upvotes: 6

vaettchen
vaettchen

Reputation: 7659

Since you want to count the number of PDF files only, you don't need the file names here, so the third line of your attempted code is unnecessary.

Start with the first two lines

BiblioPath = "C:/Biblio"
BiblioDir = list.dirs(path = BiblioPath, full.names = TRUE, recursive = FALSE)

and then create a dataframe that takes the names of the folders and the PDF counts, such as

x <- data.frame( Dir = BiblioDir, no = 0 )

and update the column with the number of files, calculated via

for( i in seq( length( BiblioDir ) ) ) x$no[ i ] <- 
    length( list.files(path = BiblioDir[ i ], pattern = "pdf", recursive = FALSE, full.names = TRUE)  )

That will give a you a data.frame x with the folder names and the PDF files per folder.

This is a loop, not sure whether "without a loop" in your question was a condition; but I don't see any reason not using a loop here.

Upvotes: 1

Related Questions