Python: How to solve merged words when extracting text from pdf?

I'm struggling with the words extraction from a set of pdf files. This files are academic papers that I downloaded from the web.

The data is stored in my local device, sorted by name, following this relative path inside the project folder: './papers/data'. You can find my data here.

My code is executing inside a code folder in the project repo ('./code')

The pdf word extraction section of the code look like this:

import PyPDF2 as pdf
from os import listdir 

#Open the files:
#I) List of files:
files_in_dir = listdir('../papers/data')
#II) Open and saving files to python objects:
papers_text_list = []
for idx in range(len(files_in_dir)):
    with open(f"../papers/data/{files_in_dir[idx]}", mode="rb") as paper:
    my_pdf = pdf.PdfFileReader(paper)
    vars()["text_%s" % idx] = ''
    for i in range(my_pdf.numPages):
        page_to_print = my_pdf.getPage(i)
        vars()["text_%s" % idx] += page_to_print.extractText()
    papers_text_list.append(vars()["text_%s" %idx])

The problem is that for some texts I'm geting merged words inside the python list.

text_1.split()

[ ... ,'examinedthee', 'ectsofdi', 'erentoutdoorenvironmentsinkindergartenchildren', '™sPAlevel,', 'ages3', 'Œ5.The', 'ndingsrevealedthatchildren', '‚sPAlevelhigherin', 'naturalgreenenvironmentsthaninthekindergarten', '™soutdoorenvir-', 'onment,whichindicatesgreenenvironmentso', 'erbetteropportunities', 'forchildrentodoPA.', ...]

While other list are imported in a correct way.

text_0.split()

['Urban','Forestry', '&', 'Urban', 'Greening', '16', '(2016)','76–83Contents', 'lists', 'available', 'at', 'ScienceDirect', 'Urban', 'Forestry', '&', 'Urban', 'Greening', ...]

At this point, I thought that tokenize could solve my problem. So I give it a chance to the nltk module.

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
doc = tokenizer,tokenize(text_1)
paper_words = [token for token in doc]
paper_words_lower = []
for token  in paper_words:
    try:
        word = token.lower()
    except TypeError:
        word = token 
    finally:
        paper_words_lower.append(word)

['contentslistsavailableat', 'sciencedirecturbanforestry', 'urbangreening', 'journalhomepage', 'www', 'elsevier', 'com', 'locate', 'ufug', 'urbangreenspacesforchildren', 'across', 'sectionalstudyofassociationswith', 'distance', 'physicalactivity', 'screentime', 'generalhealth', 'andoverweight', 'abdullahakpinar', 'adnanmenderesüniversitesi', 'ziraatfakültesi', 'peyzajmimarl', 'bölümü', '09100ayd', 'õn', 'turkey', ... 'sgeneralhealth', 'onlychildren', 'sagewas', 'signicantlyassociatedwiththeiroverweight', ...]

I even tried with the spacy module... but the problem was still there.

My conclusion here is that if the problem can be solved It has to be in the pdf extracting words section. I found this StackOverflow related question but the solution couldn't solve my problem.

Why is this happening? and How can I solve it?

PD: A paper on the list that serve as an example of trouble is "AKPINAR_2017_Urban green spaces for children.pdf".

You can use the following code to import.

import PyPDF2 as pdf
with open("AKPINAR_2017_Urban green spaces for children.pdf", mode="rb") as paper:
    my_pdf = pdf.PdfFileReader(paper)
    text = ''
    for i in range(my_pdf.numPages):
         page_to_print = my_pdf.getPage(i)
         text += page_to_print.extractText()

Upvotes: 0

Views: 1710

Answers (3)

K J
K J

Reputation: 11877

By far the simplest method is use a modern PDF viewer/editor that allows for cut and paste with some additional adjustments. I had no problems reading aloud or extracting most of those academic journals since they are (bar one) readable text, thus export well as plain text. It took 4 seconds TOTAL to export 24 of those PDF files (6 per second, except #24of25) into readable text. using forfiles /m *.pdf /C "cmd /c pdftotext -simple2 @file @fname.txt". Compare the result with your first non readable example. enter image description here

However the one exception was Hernadez_2005 because it is images thus to extract needs OCR conversion with considerable (not trivial) training of the editor to handle scientific terms and foreign hyphenation, plus constantly shifting styles. But can with some work in say WordPad produce a good enough result, fit for editing in Microsoft Word, which you could save as plain text for parsing in Python.

enter image description here

Upvotes: 1

Duloren
Duloren

Reputation: 2711

PyPDF2 is unmaintained since 2018.

The problem is because there a lot of pages recommending PyPDF2 over web but actually nobody uses it nowadays.

I recently did the same until realize PyPDF2 is dead. I ended up using https://github.com/jsvine/pdfplumber. Its is actively maintained, easy and performs very well

Upvotes: 0

EliasK93
EliasK93

Reputation: 3174

Yes, this is a problem with the extraction. The spaces in the two example documents you mention are different:

enter image description here

enter image description here

PDFs usually do not have an always clear concept of lines and words. They have characters/text boxes placed at certain places in the document. The extraction can't read it "char by char" like e.g. a txt file, it parses it from top left to bottom right and uses the distances to make assumptions what is a line, what is a word etc. Since the one in the first picture seems to not only use the space character but also character margins to the left and right to create a nicer spacing for the text, the Parser has difficulty understanding it.

Every Parser will do that slightly different, so it might make sense to try out some different parsers, perhaps another one was trained/designed on documents with similar patterns and is able to parse it correctly. Also, since the PDF in the example does have all valid spaces, but then confuses the parser by moving the characters closer to each other by some negative margin stuff, normal copy and paste into a txt file won't have that issue since it ignores the margin stuff.

If we are talking about a giant amount of data and you are willing to put some more time into this, you can check out some sources on Optical Character Recognition Post Correction (OCR Post Correction), which are models trying to fix text parsed with errors (although it usually focusses more on the issues of characters not being correctly identified through different fonts etc than on spacing issues).

Upvotes: 0

Related Questions