Thanh Long Phan
Thanh Long Phan

Reputation: 43

Filter pdf text by font wih pdfminer

So I am using pdfminer.six to extract text by a specific font. But currently I have this following problem:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar

def extract_text_by_font(pdf_file):
    extracted_text = ""

    for page_layout in extract_pages(pdf_file):
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                for text_line in element:
                    for character in text_line:
                        if isinstance(character, LTChar):
                            extracted_text += character.get_text()

    return extracted_text

If I compare output from this function with from pdfminer.high_level.extract_text, then extract_text_by_font does not extract the text properly. For example with pdfminer.high_level.extract_text I get

"... Hello World..."

but with extract_text_by_font I get

"...HelloWorld...".

So it removes sometime the whitespaces. Can you fix it?

Upvotes: 0

Views: 72

Answers (1)

Noname NoSurname
Noname NoSurname

Reputation: 404

Try this:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar

def extract_text_by_font(pdf_file):
    extracted_text = ""
    prev_x = 0

    for page_layout in extract_pages(pdf_file):
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                for text_line in element:
                    for character in text_line:
                        if isinstance(character, LTChar):
                            # Adding a space if the difference in x-coordinates
                            # is more than the character width.
                            if character.x0 - prev_x > character.width:
                                extracted_text += ' '

                            extracted_text += character.get_text()
                            prev_x = character.x0 + character.width

    return extracted_text

Upvotes: 0

Related Questions