PDFminer get font size from headers per each page (iteration)

Question

I am quite new to python and PDFminer which is a bit complex for me, what I am trying to achieve is extract the title each page from a pdf file or slides.

My approach is getting a list of the text lines and the font size per page, then I will pick the highest number as the slide heading usually written in a higher font size.

This is what I did so far:

Suppose I want to get the page #8 title from this pdf file. File sample

This is how page #8 content looks like:

This is the code to get all pages font size per line:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams
import os
path=r'cov.pdf'

Extract_Data=[]

for page_layout in extract_pages(path):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            for text_line in element:
                for character in text_line:
                    if isinstance(character, LTChar):
                        Font_size=character.size
            Extract_Data.append([Font_size,(element.get_text())])

The generated list Extract_Data is for all pages of the pdf document. My question is how can I get this list for each page (iteration) of the document?

expected output for page number 8 only and so on for each page / then if I want to pick the page title, it will be the item(line) with the highest value in font size:

[[32.039999999999964, 'Pandemic declaration 
'],
 [24.0, ' 
'],
 [24.0, ' 
'],
 [24.0,
  '•  On March 11, 2020, the World Health Organization 
(WHO) characterized COVID-19 as a pandemic. 
 
•  It has caused severe illness and death. It features 
 
sustained person-to-person spread worldwide. 
'],
 [24.0, ' 
'],
 [24.0, ' 
'],
 [24.0, ' 
'],
 [24.0, ' 
'],
 [24.0, '•  It poses an especially high risk for the elderly (60 or 
 
'],
 [24.0, ' 
'],
 [24.0, ' 
'],
 [24.0, ' 
'],
 [24.0, ' 
'],
 [24.0, ' 
'],
 [24.0,
  'older), people with preexisting health conditions such 
as high blood pressure, heart disease, lung disease, 
  
diabetes, autoimmune disorders, and certain workers. 
 
'],
 [24.0, ' 
'],
 [24.0, ' 
'],
 [24.0, ' 
'],
 [24.0, ' 
'],
 [24.0, ' 
'],
 [14.04, '8 
']]

Pieter · Accepted Answer

Full disclosure, I'm one of the maintainers of pdfminer.six.

A pythonic way of doing this would be the following.

import os

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar


def get_font_sizes(paragraph: LTTextContainer):
    """Get the font sizes for every LTChar element in this LTTextContainer"""
    return [
        char.size
        for line in paragraph
        for char in line
        if isinstance(char, LTChar)
    ]


def list_sized_paragraphs(page):
    """List all the paragraphs and their maximum font size on this page"""
    return [
        (max(get_font_sizes(paragraph)), paragraph.get_text())
        for paragraph in page
        if isinstance(paragraph, LTTextContainer)
    ]


file_path = '~/Downloads/covid_19_training_tool_v3_01.05.2021_508.pdf'
for page in extract_pages(os.path.expanduser(file_path)):
    _, text = max(list_sized_paragraphs(page))
    print('---')
    print(text.strip())

For page 8 this prints:

Pandemic declaration

Note: this does not work for all pages because sometimes a caution or note has a bigger font size than then header.

PDFminer get font size from headers per each page (iteration)

Answers (1)

Related Questions