Reputation: 13
I am trying to find a solution to automate a task. In effect, I have a PDF file that I get from a website:
For example the following PDF: https://www.dgssi.gov.ma/sites/default/files/vulnerabilites_affectant_plusieurs_produits_de_cisco_13.pdf
I want to collect the information from the file in the form of a Python Dictionary {'bold sentence': 'the sentences after the bold sentence'}
Example: {....... , 'Solution': 'Veuillez se référer aux bulletins de sécurité de Cisco pour mettre à jours vos équipements', .....}
I already tried to transform the PDF to HTML and do some web scraping but there is no way to make the difference between several HTML tags because all the tags are similar.
If you can propose to me a solution or a code to make the extraction in the form of a dictionary I will be very grateful.
Any help would be appreciated, and if I need to be more specific let me know.
Upvotes: 0
Views: 2322
Reputation: 2645
Basically, PDFs don't contain bold or italic text. But, they do contain variants of the same font-family to get bold text. we can take advantage of this and search for the font-name for the text and see if it contains "bold".
You could use extract_pages
and iterate over every character and check the font name to see if it contains "bold".
you could also use pdfplumber
to achieve the same outcome
with pdfplumber.open(file_to_parse) as pdf:
text = pdf.pages[0]
clean_text = text.filter(lambda obj: not (obj["object_type"] == "char" and "Bold" in obj["fontname"]))
print(clean_text.extract_text())
I would convert the file to doc using methods described at the end, and that it would be much easier to parse, BUT I haven't done that in a long time.
first option - using LibreOffice
lowriter --invisible --convert-to doc '/your/file.pdf'
second option - using only python
import os
import subprocess
for top, dirs, files in os.walk('/my/pdf/folder'):
for filename in files:
if filename.endswith('.pdf'):
abspath = os.path.join(top, filename)
subprocess.call('lowriter --invisible --convert-to doc "{}"'
.format(abspath), shell=True)
then extract all bold sentences:
from docx import *
document = Document('path_to_your_files')
bolds=[]
italics=[]
for para in document.paragraphs:
for run in para.runs:
if run.italic :
italics.append(run.text)
if run.bold :
bolds.append(run.text)
boltalic_Dict={'bold_phrases':bolds,
'italic_phrases':italics}
Upvotes: 1