Reputation: 43
So I am using pdfminer.six to extract text by a specific font. But currently I have this following problem:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
def extract_text_by_font(pdf_file):
extracted_text = ""
for page_layout in extract_pages(pdf_file):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
extracted_text += character.get_text()
return extracted_text
If I compare output from this function with from pdfminer.high_level.extract_text
, then extract_text_by_font
does not extract the text properly. For example with pdfminer.high_level.extract_text
I get
"... Hello World..."
but with extract_text_by_font
I get
"...HelloWorld...".
So it removes sometime the whitespaces. Can you fix it?
Upvotes: 0
Views: 72
Reputation: 404
Try this:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
def extract_text_by_font(pdf_file):
extracted_text = ""
prev_x = 0
for page_layout in extract_pages(pdf_file):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
# Adding a space if the difference in x-coordinates
# is more than the character width.
if character.x0 - prev_x > character.width:
extracted_text += ' '
extracted_text += character.get_text()
prev_x = character.x0 + character.width
return extracted_text
Upvotes: 0