Reputation: 85
I have a fairly simple document (the governmental strategies for the mandate) with titles, normal text and then bullet points (which are the strategies they are looking to implement).
I can read and extract pages or titles from the.pdf
and convert it to a .txt
but I'd like to only keep the text (the whole paragraphs) inside bullet points, which is what I'm interested in. I reckon there's some way to do this as they can be identified by the bullet point itself probably.
Is there a simple enough way to do this in R and/or Python? I'm not familiar with other programming languages or parsing methods.
EDIT: Just quickly converted basic text to HTML (using https://wordtohtml.net) on a page and it seems to turn bullet points to <li>
which I'm guessing would be easy enough to parse through. Is there a quick-n-easy way to convert the whole 262-page document to HTML keeping the <li>
format probably in R/python? Or do you know of a PDF-way - preferable as it would be at least one less step to do that - for the my issue?
Upvotes: 1
Views: 3541
Reputation: 7312
Here's my general approach:
Read in a sample string
require(stringr)
string <- "passarão a estar inscritas políticas públicas que permitam:\n • Inverter a tendência de perda de
rendimento das famílias, dos trabalhadores, dos\n funcionários públicos e dos pensionistas;\n"
Split by \n
# match semi-colon or colon, then a backslash, then "n". I.E. split by ;\n or :\n
stringList <- unlist(str_split(string, "([;:])\\\n"))
Return position of any string that starts with a bullet:
matched <- grep("\\\u0095", stringList)
Subset to strings that start with bullets:
stringList[matched]
The weak part of this solution currently is that it relies on bullets being preceded by ";\n" or ":\n". If you just split by "\n" you lose the second part of a bullet whenever it continues onto a second line. Depending on the format of the document, you may have to change the regex around to make sure you split the string appropriately
You could also do the initial split by bullet: stringList <- unlist(str_split(string, "\\u0095"))
but then you need a rule to define where the bullet ends and plain text begins.
Upvotes: 4