Extract only bullet points from PDF using R or Python

Question

I have a fairly simple document (the governmental strategies for the mandate) with titles, normal text and then bullet points (which are the strategies they are looking to implement).

I can read and extract pages or titles from the.pdf and convert it to a .txt but I'd like to only keep the text (the whole paragraphs) inside bullet points, which is what I'm interested in. I reckon there's some way to do this as they can be identified by the bullet point itself probably.

Is there a simple enough way to do this in R and/or Python? I'm not familiar with other programming languages or parsing methods.

EDIT: Just quickly converted basic text to HTML (using https://wordtohtml.net) on a page and it seems to turn bullet points to

which I'm guessing would be easy enough to parse through. Is there a quick-n-easy way to convert the whole 262-page document to HTML keeping the

format probably in R/python? Or do you know of a PDF-way - preferable as it would be at least one less step to do that - for the my issue?

Extract only bullet points from PDF using R or Python

Answers (1)

Related Questions