V J
V J

Reputation: 171

How to find the Font Size of every paragraph of PDF file using python code?

Right now i am Working on a project in which i have to find the font size of every paragraph in that PDF file. i have tried various python libraries like fitz, PyPDF2, pdfrw, pdfminer, pdfreader. all the libraries fetch the text data but i don't know how to fetch the font size of the paragraphs. thanks in advance..your help is appreciated.

i have tried this but failed to get font size.

import fitz

filepath = '/home/user/Downloads/abc.pdf'
text = ''
with fitz.open(filepath ) as doc:
    for page in doc:
        text+= page.getText()
print(text)

Upvotes: 7

Views: 13791

Answers (2)

Vihaan Thora
Vihaan Thora

Reputation: 91

A better way to do this would be to use pymupdf/fitz itself. This library is significantly faster and cleaner in scraping the font information as compared to pdfminer. An example code snippet is shown below.

import fitz

def scrape(keyword, filePath):
    results = [] # list of tuples that store the information as (text, font size, font name) 
    pdf = fitz.open(filePath) # filePath is a string that contains the path to the pdf
    for page in pdf:
        dict = page.get_text("dict")
        blocks = dict["blocks"]
        for block in blocks:
            if "lines" in block.keys():
                spans = block['lines']
                for span in spans:
                    data = span['spans']
                    for lines in data:
                        if keyword in lines['text'].lower(): # only store font information of a specific keyword
                            results.append((lines['text'], lines['size'], lines['font']))
                            # lines['text'] -> string, lines['size'] -> font size, lines['font'] -> font name
    pdf.close()
    return results

If you wish to find the font information of every line, you may omit the if condition that checks for a specific keyword.

You can extract the text information in any desired format by understanding the structure of dictionary outputs that we obtain by using get_text("dict"), as mentioned in the documentation.

Upvotes: 4

V J
V J

Reputation: 171

I got the solution from pdfminer. The python code for the same is given below.

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams
import os
path=r'/path/to/pdf'

Extract_Data=[]

for page_layout in extract_pages(path):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            for text_line in element:
                for character in text_line:
                    if isinstance(character, LTChar):
                        Font_size=character.size
            Extract_Data.append([Font_size,(element.get_text())])

Upvotes: 8

Related Questions