Reputation: 31
I am trying to extract text in pdf miner by inputting co-ordinates, I have searched the internet but could not find any documentation or code relating to that.So far I have found a code that extracts text and outputs its co-ordinates.
LTTextBoxHorizontal
(317.564, 91.32756, 580.93228, 116.24235999999999)
SHOULD ANY OF THE ABOVE DESCRIBED POLICIES BE CANCELLED BEFORE
THE EXPIRATION DATE THEREOF, NOTICE WILL BE DELIVERED IN
ACCORDANCE WITH THE POLICY PROVISIONS.
This is one of the output co-ordinates and text that I have obtained. I also tried pdfquery but I have got a lots of error.
File "C:\Python27\lib\site-packages\pyquery-1.2.11-py2.7.egg\pyquery\pyquery.py", line 268, in __call__
result = self._copy(*args, parent=self, **kwargs)
File "C:\Python27\lib\site-packages\pyquery-1.2.11-py2.7.egg\pyquery\pyquery.py", line 253, in _copy
return self.__class__(*args, **kwargs)
File "C:\Python27\lib\site-packages\pyquery-1.2.11-py2.7.egg\pyquery\pyquery.py", line 239, in __init__
xpath = self._css_to_xpath(selector)
File "C:\Python27\lib\site-packages\pyquery-1.2.11-py2.7.egg\pyquery\pyquery.py", line 249, in _css_to_xpath
return self._translator.css_to_xpath(selector, prefix)
File "build\bdist.win32\egg\cssselect\xpath.py", line 192, in css_to_xpath
File "build\bdist.win32\egg\cssselect\parser.py", line 355, in parse
File "build\bdist.win32\egg\cssselect\parser.py", line 370, in parse_selector_group
File "build\bdist.win32\egg\cssselect\parser.py", line 378, in parse_selector
File "build\bdist.win32\egg\cssselect\parser.py", line 437, in parse_simple_selector
File "build\bdist.win32\egg\cssselect\parser.py", line 535, in parse_attrib
cssselect.parser.SelectorSyntaxError: Expected string or ident, got <NUMBER '1' at 14>
Can someone help me with that ?
Upvotes: 0
Views: 816
Reputation: 4021
That happens when you don't escape pageid value.
Try:
LTPage[pageid=\'1\']
Upvotes: 4