Reputation: 37
I have a list of keywords to find text in a group of PDF files, some of the keyword must appear combined to extract the text even if they are not together.
I used the pdfsearch library and it finds text with the separated keywords. I read the documentation but I am not able to find a way to combine the keywords.
My code is as shown below:
library(pdftools)
library(pdfsearch)
keywords <- c("LOTE","VOLUMEN",
"LOTE","SOLVENCIA",
"LOTE","SEGURO",
"VOLUMEN","TRES ÚLTIMOS",
"VOLUMEN","3 ÚLTIMOS",
"VOLUMEN","(3) ÚLTIMOS",
"NO", "APLICA", "SOLVENCIA")
Results <- keyword_directory(directory,
keyword = keywords,
surround_lines = 1, full_names = TRUE,
ignore_case = TRUE, remove_hyphen = TRUE)
In the keyword assignation, every line is a combination:
"LOTE" + "SOLVENCIA",
"LOTE" + "SEGURO",
"VOLUMEN" + "TRES ÚLTIMOS",
"VOLUMEN"+ "3 ÚLTIMOS",
"VOLUMEN" + "(3) ÚLTIMOS",
"NO" + "APLICA" + "SOLVENCIA"
For example the combination "NO" + "APLICA" + "SOLVENCIA"
This text should be extracted "No siempre aplica el uso de solvencia para el proyecto"
This text should no be extracted even if it has the keyword "NO" "No pueden contar con las listas antes de tiempo"
At the moment I am able just to get the text where the separated keyword appear.
Upvotes: 1
Views: 899
Reputation: 1812
I am assuming you want all keywords in a 'group' of keywords to be present in a single line, in order for you to extract that single line. If you want the keywords to be present in a single file to extract all text from that file, let me know so I can adjust the answer.
Indeed pdfsearch::keyword_search()
searches only individual words. Luckily it does give us a page number and a line number for each result, so we can match those and check if all words from a single group are present in the search results on the same line:
We start by defining our keywords grouped into vectors, and loading an example file:
library(pdfsearch)
library(dplyr)
# Our list of keywords, grouped in vectors
grouped_keywords <- list(c('saturated','model'),
c('vector','specification'),
c('framework','inferences'),
c('test','that','gives','no','results'),
c('population','degree','types'))
# Example file supplied with `pdfsearch`, also available at https://arxiv.org/pdf/1610.00147.pdf
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')
To start the search, we perform keyword_search()
on a flattened version of grouped_keywords
. This will yield all results we want, but also many results we don't want (lines that only contain one or a few of the keywords in a group).
# Search for individual keywords
individual_results <- keyword_search(file,
keyword = unlist(grouped_keywords), # combine our keyword list into a single 1-dimensional vector
path = TRUE)
cat(nrow(individual_results), 'results for individual words\n')
head(individual_results, n=3)
Result:
367 results for individual words
# A tibble: 3 × 5
keyword page_num line_num line_text token_text
<chr> <int> <int> <list> <list>
1 saturated 5 112 <chr [1]> <list [1]>
2 saturated 5 114 <chr [1]> <list [1]>
3 saturated 5 119 <chr [1]> <list [1]>
For each group of keywords, we look for results that have the same line number and the same page number, ánd that match all keywords in the group:
combined_results <- lapply(grouped_keywords, \(keyword_group) {
individual_results %>%
filter(keyword %in% keyword_group) %>%
group_by(page_num, line_num) %>%
filter(length(unique(keyword)) == length(unique(keyword_group))) %>%
summarise(keywords = paste(keyword_group, collapse=' + '),
line_text = line_text[1],
token_text = token_text[1],
.groups="keep")
})
# Merge list of tibbles to a single tibble
combined_results <- do.call(rbind, combined_results)
# Output result
cat(nrow(combined_results), 'results for combined words\n')
combined_results
Result:
8 results for combined words
# A tibble: 8 × 5
# Groups: page_num, line_num [8]
page_num line_num keywords line_text token_text
<int> <int> <chr> <list> <list>
1 5 112 saturated + model <chr [1]> <list [1]>
2 5 114 saturated + model <chr [1]> <list [1]>
3 5 119 saturated + model <chr [1]> <list [1]>
4 7 184 saturated + model <chr [1]> <list [1]>
5 5 124 vector + specification <chr [1]> <list [1]>
6 2 32 framework + inferences <chr [1]> <list [1]>
7 7 168 framework + inferences <chr [1]> <list [1]>
8 7 187 population + degree + types <chr [1]> <list [1]>
Edit 19 July 2022:
To get only exact matches, it's not enough to add the additional filter constrain which I previously described in the comments; we also need to process that filter rowwise:
combined_results <- lapply(grouped_keywords, \(keyword_group) {
individual_results %>%
rowwise() %>%
filter(keyword %in% keyword_group, tolower(keyword) %in% tolower(unlist(token_text))) %>%
group_by(pdf_name, line_num) %>%
filter(length(unique(keyword)) == length(unique(keyword_group))) %>%
summarise(keywords = paste(keyword_group, collapse=' + '),
line_text = line_text[1],
token_text = token_text[1],
.groups="keep")
})
The complete code, also using keyword_directory()
instead of keyword_search()
and matching case-insensitive, becomes:
# Preparation -------------------------------------------------------------
library(pdfsearch)
library(dplyr)
# Our list of keywords, grouped in vectors
grouped_keywords <- list(c('individuals','model'),
c('information','abundance'),
c('individual','ranking'),
c('Test','That','Gives','No','Results'),
c('population','degree','types'))
grouped_keywords <- lapply(grouped_keywords, tolower)
# Directory containing a few example PDFs:
# https://arxiv.org/pdf/2207.00011.pdf
# https://arxiv.org/pdf/2207.00039.pdf
# https://arxiv.org/pdf/2207.00076.pdf
directory <- "~/Desktop/Rtemp/pdf/"
# Search for individual keywords ------------------------------------------
individual_results <- keyword_directory(directory,
keyword = unlist(grouped_keywords), # combine our keyword list into a single 1-dimensional vector
split_pdf = TRUE)
cat(nrow(individual_results), 'results for individual words\n')
View(individual_results)
# Merge results for keywords in the same subgroup and file ----------------
combined_results <- lapply(grouped_keywords, \(keyword_group) {
individual_results %>%
rowwise() %>%
filter(keyword %in% keyword_group, tolower(keyword) %in% tolower(unlist(token_text))) %>%
group_by(pdf_name, line_num) %>%
filter(length(unique(keyword)) == length(unique(keyword_group))) %>%
summarise(keywords = paste(keyword_group, collapse=' + '),
line_text = line_text[1],
token_text = token_text[1],
.groups="keep")
})
# Merge list of tibbles to a single tibble
combined_results <- do.call(rbind, combined_results)
# Output result
cat(nrow(combined_results), 'results for combined words\n')
combined_results
> combined_results
# A tibble: 4 × 5
# Groups: pdf_name, line_num [4]
pdf_name line_num keywords line_text token_text
<chr> <int> <chr> <list> <list>
1 2207.00039.pdf 299 individuals + model <chr [1]> <list [1]>
2 2207.00076.pdf 4 individuals + model <chr [1]> <list [1]>
3 2207.00076.pdf 16 individuals + model <chr [1]> <list [1]>
4 2207.00039.pdf 10 information + abundance <chr [1]> <list [1]>
You'll notice I also added split_pdf = TRUE
to the keyword_directory()
-call, to improve handling of multi-column PDFs. I have also removed matching on the page number; matching on line number alone is enough.
Upvotes: 2