Reputation: 11
I'm new in programming and I have to admit it's being a bit difficult.
I'm trying to extract information from several pdf files. I already have al the information carhged in R and what I need to extract now is in bullets.
Example:
From each document in pdf I have information like that, but the heading is different. The only common points is that the information is in bullets:
…This research contains the following information:
• Red flags detected
• Specific glossary
• Documents provided by other units
According to the studies…
I only need to extract: Red flags detected.Specific glossary. Documents provided by other units
My idea is to create a dataframe which contains a summary of all information from each document.
my code is:
library(dplyr)
library(stringr)
library(pdftools)
library(stringr)
require(pdftools)
require(tm)
setwd("C:/documents/R_Studio")
#route
list.files()
#vector for pdf files
files <- list.files(pattern=".pdf$")
ci <- lapply(files, pdf_text)
#load text from all the files
length(ci)
#verify the files number
lapply(ci, length)
#verify page number of each file
str_extract_all(ci)
# In this part is where I'm having problems. I only need to extract the information in bullets
test<-unlist(str_split(ci, "([;:])\\\n\n"))
#I've tried with test 2 but I don't know hot to extract the info.
The result is the following:
\\n\\nThis reaserch contains the following information:\\n\\n <U+25CF> Red flags detected\\n <U+25CF> Specific glossary\\n <U+25CF> Documents provided by other units"\n\nAccording to the studies...'``
Upvotes: 1
Views: 366