Jose David
Jose David

Reputation: 11

how to extract text in bullets points

I'm new in programming and I have to admit it's being a bit difficult.

I'm trying to extract information from several pdf files. I already have al the information carhged in R and what I need to extract now is in bullets.

Example:

From each document in pdf I have information like that, but the heading is different. The only common points is that the information is in bullets:

…This research contains the following information:
•   Red flags detected
•   Specific glossary
•   Documents provided by other units
According to the studies…

I only need to extract: Red flags detected.Specific glossary. Documents provided by other units

My idea is to create a dataframe which contains a summary of all information from each document.

my code is:

library(dplyr) library(stringr) library(pdftools) library(stringr) require(pdftools) require(tm)

setwd("C:/documents/R_Studio") #route list.files() #vector for pdf files files <- list.files(pattern=".pdf$")

ci <- lapply(files, pdf_text) #load text from all the files length(ci) #verify the files number lapply(ci, length) #verify page number of each file str_extract_all(ci) # In this part is where I'm having problems. I only need to extract the information in bullets test<-unlist(str_split(ci, "([;:])\\\n\n")) #I've tried with test 2 but I don't know hot to extract the info.

The result is the following:

\\n\\nThis reaserch contains the following information:\\n\\n   <U+25CF>   Red flags detected\\n   <U+25CF>   Specific glossary\\n   <U+25CF>   Documents provided by other units"\n\nAccording to the studies...'``

Upvotes: 1

Views: 366

Answers (0)

Related Questions