Sweety
Sweety

Reputation: 85

How to extract a keyword and its page number from a PDF file using NLP?

enter image description here

enter image description here

In the above PDF file, my code has to extract keywords and Table Names like Table 1, Table 2, Title with Bold Letters like INTRODUCTION, CASE PRESENTATION from all pages from the given PDF.

Wrote a small program to extract texts from the PDF file

punctuations = ['(',')',';',':','[',']',',','^','=','-','!','.','{','}','/','#','^','&']

stop_words = stopwords.words('English')

keywords = [word for word in tokens if not word in stop_words and not word in punctuations]

print(keywords)

and the output I got was as below

enter image description here

From the above output, How to extract keywords like INTRODUCTION, CASE PRESENTATION, Table 1 along with the page number and save them in a output file.

Output Format

INTRODUCTION in Page 1

CASE PRESENTATION in Page 3

Table 1 (Descriptive Statistics) in Page 5

Need help in obtaining output of this format.

Code

def main():

        file_name = open("Test1.pdf","rb")
        readpdf = PyPDF2.PdfFileReader(file_name)
    

    #Parse thru each page to extract the texts
        pdfPages = readpdf.numPages
        count=0
        text=""
        print()
        #The while loop will read each page.
        while count < pdfPages:
            pageObj = readpdf.getPage(count)
            count +=1
            text += pageObj.extractText()

        #This if statement exists to check if the above library returned words. It's done because PyPDF2 cannot read scanned files.
        if text != "":
            text = text
        #If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text.
        else:
            text = textract.process(fileurl, method='tesseract', language='eng')

        #PRINT THE TEXT EXTRACTED FROM GIVEN PDF
        #print(text)

        #The function will break text into individual words
    
        tokens = word_tokenize(text)
        #print('TOKENS')
        #print(tokens)

        #Clean the punctuations not required.
        punctuations = ['(',')',';',':','[',']',',','^','=','-','!','.','{','}','/','#','^','&']
        
        stop_words = stopwords.words('English')
        
        keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
   
        print(keywords)

Upvotes: 0

Views: 1600

Answers (3)

Jabzer
Jabzer

Reputation: 11

Issue partially resolved here: https://github.com/konfuzio-ai/document-ai-python-sdk/issues/6#issue-876036328

Check: https://github.com/konfuzio-ai/document-ai-python-sdk

# pip install konfuzio_sdk
# in working directory
# konfuzio_sdk init

from konfuzio_sdk.api import get_document_annotations
    
document_first_annotation = get_document_annotations(document_id=1111)[0]
page_index = document_first_annotation['bbox']['page_index']
keyword = document_first_annotation['offset_string']

The object Annotation in the Konfuzio SDK allows to access directly to the keyword string but, at the moment, not directly to the page index. This attribute will be added soon. An example to access the first annotation in the first training document of your project would be:

# pip install konfuzio_sdk
# in working directory
# konfuzio_sdk init

from konfuzio_sdk.data import Project

my_project = Project()

annotations_first_doc = my_project.documents[0].annotations()
first_annotation = annotations_first_doc[0]
keyword = first_annotation.offset_string

Upvotes: 1

Sweety
Sweety

Reputation: 85

 import PyPDF2
 import pandas 
 import numpy
 import re
 import os,sys
 import nltk
 import fitz
 
 def main():
     file_name = open("File1.pdf","rb")
     readPDF = PyPDF2.PdfFileReader(file_name)
     call_function(file_name,readPDF)   

 def call_function(fname,readpdf)
     pdfPages = readpdf.numPages
          
     for pageno in range(pdfPages):
        doc_name = fitz.open(fname.name)
        page = word_tokenize(doc_name[pageno].get_text())
        page_texts = [word for word in page if not word in stop_words and not word in punctuations]
        print('Page Number:',pageno)
        print('Page Texts :',page_texts)
        

Upvotes: 0

furas
furas

Reputation: 142681

If you want information on which page is some text then you shouldn't add all to one string but you should work with every page separatelly (in for-loop`)

It could be something similar to this. It is code without tesseract which would need method to split PDF to separated pages and works with every page separatelly

pdfPages = readpdf.numPages

# create it before loop
punctuations = ['(',')',';',':','[',']',',','^','=','-','!','.','{','}','/','#','^','&']
stop_words = stopwords.words('English')

#all_pages = []

# work with every page separatelly
for count in range(pdfPages):

    pageObj = readpdf.getPage(count)

    page_text = pageObj.extractText()
    
    page_tokens = word_tokenize(page_text)

    page_keywords = [word for word in page_tokens if not word in stop_words and not word in punctuations]

    page_uppercase_words = [word for word in page_keywords if word.isupper()]

    #all_pages.append( (count, page_keywords, page_uppercase_words) )

    print('page:', count)
    print('keywords:', page_keywords) 
    print('uppercase:', page_uppercase_words)

    # TODO: append/save page to file 

Upvotes: 1

Related Questions