Reputation: 2554
I am making a Java jar file call from Python.
def extract_words(file_path):
"""
Extract words and bounding boxes
Arguments:
file_path {[str]} -- [Input file path]
Returns:
[Document]
"""
extractor = PDFBoxExtractor(file_path=file_path,jar_path="external/pdfbox-app-2.0.15.jar",class_path="external")
document = extractor.run()
return document
And somewhere:
pipe = subprocess.Popen(['java',
'-cp',
'.:%s:%s' %
(self._jar_path,
self._class_path) ,
'PrintTextLocations',
self._file_path],
stdout=subprocess.PIPE)
output = pipe.communicate()[0].decode()
This is working fine. But the problem is the jar is heavy and when I have to call this multiple times in a loop, it takes 3-4 seconds to load the jar file each time. If I run this in a loop for 100 iterations, it adds 300-400 seconds to the process.
Is there any way to keep the classpath alive for java and not load jar file every time? Whats the best way to do it in time optimised manner?
Upvotes: 2
Views: 54
Reputation: 340
You can encapsulate your PDFBoxExtractor in a class my making it a class member. Initialize the PDFBoxExtractor in the constructor of the class. Like below:
class WordExtractor:
def __init__(self):
self.extractor = PDFBoxExtractor(file_path=file_path,jar_path="external/pdfbox-app-2.0.15.jar",class_path="external")
def extract_words(self,file_path):
"""
Extract words and bounding boxes
Arguments:
file_path {[str]} -- [Input file path]
Returns:
[Document]
"""
document = self.extractor.run()
return document
Next step would be to create instance of WordExtractor class outside the loop.
word_extractor = WordExtractor()
#your loop would go here
while True:
document = word_extractor.extract_words(file_path);
This is just example code to explain the concept. You may tweak it the way you want as per your requirement.
Hope this helps !
Upvotes: 2