Walid Tahir
Walid Tahir

Reputation: 13

Extract / scrap data from PDF with python

I am trying to find a solution to automate a task. In effect, I have a PDF file that I get from a website:

For example the following PDF: https://www.dgssi.gov.ma/sites/default/files/vulnerabilites_affectant_plusieurs_produits_de_cisco_13.pdf

I want to collect the information from the file in the form of a Python Dictionary {'bold sentence': 'the sentences after the bold sentence'}

Example: {....... , 'Solution': 'Veuillez se référer aux bulletins de sécurité de Cisco pour mettre à jours vos équipements', .....}

I already tried to transform the PDF to HTML and do some web scraping but there is no way to make the difference between several HTML tags because all the tags are similar.

If you can propose to me a solution or a code to make the extraction in the form of a dictionary I will be very grateful.

Any help would be appreciated, and if I need to be more specific let me know.

Upvotes: 0

Views: 2322

Answers (1)

Guy Nachshon
Guy Nachshon

Reputation: 2645

EDIT - adding another approach

Basically, PDFs don't contain bold or italic text. But, they do contain variants of the same font-family to get bold text. we can take advantage of this and search for the font-name for the text and see if it contains "bold".

You could use extract_pages and iterate over every character and check the font name to see if it contains "bold".

you could also use pdfplumber to achieve the same outcome

with pdfplumber.open(file_to_parse) as pdf: 
    text = pdf.pages[0]
    clean_text = text.filter(lambda obj: not (obj["object_type"] == "char" and "Bold" in obj["fontname"]))
    print(clean_text.extract_text())

I would convert the file to doc using methods described at the end, and that it would be much easier to parse, BUT I haven't done that in a long time.

Converting to DOC

first option - using LibreOffice

lowriter --invisible --convert-to doc '/your/file.pdf'

second option - using only python

import os
import subprocess

for top, dirs, files in os.walk('/my/pdf/folder'):
    for filename in files:
        if filename.endswith('.pdf'):
            abspath = os.path.join(top, filename)
            subprocess.call('lowriter --invisible --convert-to doc "{}"'
                            .format(abspath), shell=True)

then extract all bold sentences:

from docx import *

document = Document('path_to_your_files')
bolds=[]
italics=[]
for para in document.paragraphs:
    for run in para.runs:
        if run.italic :
            italics.append(run.text)
        if run.bold :
            bolds.append(run.text)

boltalic_Dict={'bold_phrases':bolds,
              'italic_phrases':italics}

Upvotes: 1

Related Questions