Reputation: 85
In the above PDF file, my code has to extract keywords and Table Names like Table 1
, Table 2
, Title with Bold Letters like INTRODUCTION
, CASE PRESENTATION
from all pages from the given PDF.
Wrote a small program to extract texts from the PDF file
punctuations = ['(',')',';',':','[',']',',','^','=','-','!','.','{','}','/','#','^','&']
stop_words = stopwords.words('English')
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
print(keywords)
and the output I got was as below
From the above output, How to extract keywords like INTRODUCTION
, CASE PRESENTATION
, Table 1
along with the page number and save them in a output file.
Output Format
INTRODUCTION in Page 1
CASE PRESENTATION in Page 3
Table 1 (Descriptive Statistics) in Page 5
Need help in obtaining output of this format.
Code
def main():
file_name = open("Test1.pdf","rb")
readpdf = PyPDF2.PdfFileReader(file_name)
#Parse thru each page to extract the texts
pdfPages = readpdf.numPages
count=0
text=""
print()
#The while loop will read each page.
while count < pdfPages:
pageObj = readpdf.getPage(count)
count +=1
text += pageObj.extractText()
#This if statement exists to check if the above library returned words. It's done because PyPDF2 cannot read scanned files.
if text != "":
text = text
#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text.
else:
text = textract.process(fileurl, method='tesseract', language='eng')
#PRINT THE TEXT EXTRACTED FROM GIVEN PDF
#print(text)
#The function will break text into individual words
tokens = word_tokenize(text)
#print('TOKENS')
#print(tokens)
#Clean the punctuations not required.
punctuations = ['(',')',';',':','[',']',',','^','=','-','!','.','{','}','/','#','^','&']
stop_words = stopwords.words('English')
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
print(keywords)
Upvotes: 0
Views: 1600
Reputation: 11
Issue partially resolved here: https://github.com/konfuzio-ai/document-ai-python-sdk/issues/6#issue-876036328
Check: https://github.com/konfuzio-ai/document-ai-python-sdk
# pip install konfuzio_sdk
# in working directory
# konfuzio_sdk init
from konfuzio_sdk.api import get_document_annotations
document_first_annotation = get_document_annotations(document_id=1111)[0]
page_index = document_first_annotation['bbox']['page_index']
keyword = document_first_annotation['offset_string']
The object Annotation in the Konfuzio SDK allows to access directly to the keyword string but, at the moment, not directly to the page index. This attribute will be added soon. An example to access the first annotation in the first training document of your project would be:
# pip install konfuzio_sdk
# in working directory
# konfuzio_sdk init
from konfuzio_sdk.data import Project
my_project = Project()
annotations_first_doc = my_project.documents[0].annotations()
first_annotation = annotations_first_doc[0]
keyword = first_annotation.offset_string
Upvotes: 1
Reputation: 85
import PyPDF2
import pandas
import numpy
import re
import os,sys
import nltk
import fitz
def main():
file_name = open("File1.pdf","rb")
readPDF = PyPDF2.PdfFileReader(file_name)
call_function(file_name,readPDF)
def call_function(fname,readpdf)
pdfPages = readpdf.numPages
for pageno in range(pdfPages):
doc_name = fitz.open(fname.name)
page = word_tokenize(doc_name[pageno].get_text())
page_texts = [word for word in page if not word in stop_words and not word in punctuations]
print('Page Number:',pageno)
print('Page Texts :',page_texts)
Upvotes: 0
Reputation: 142681
If you want information on which page is some text then you shouldn't add all to one string but you should work with every page separatelly (in for
-loop`)
It could be something similar to this. It is code without tesseract
which would need method to split PDF to separated pages and works with every page separatelly
pdfPages = readpdf.numPages
# create it before loop
punctuations = ['(',')',';',':','[',']',',','^','=','-','!','.','{','}','/','#','^','&']
stop_words = stopwords.words('English')
#all_pages = []
# work with every page separatelly
for count in range(pdfPages):
pageObj = readpdf.getPage(count)
page_text = pageObj.extractText()
page_tokens = word_tokenize(page_text)
page_keywords = [word for word in page_tokens if not word in stop_words and not word in punctuations]
page_uppercase_words = [word for word in page_keywords if word.isupper()]
#all_pages.append( (count, page_keywords, page_uppercase_words) )
print('page:', count)
print('keywords:', page_keywords)
print('uppercase:', page_uppercase_words)
# TODO: append/save page to file
Upvotes: 1