Reputation: 2189
I was trying to use a python module textract
to extract the text from images and since the images contains so much noise the output i was getting is noise in addition to the actual text that i am interested in. Can someone suggest the code to best ways to clean up the output.
Here is my code:
>>> for i in glob.glob("*.jpg"):
... print(textract.process(i))
Here is my output:
...
-s.
4‘-0-.r-v .-
5,14,45_18685-M
c.
.4
"V-0-an .-
5,14,44_17793-M
5,13,66
17951-N
5,13,65_17959-N
Basically what i want is the lines that starts with number "5" and nothing else. So i added a line to my code above but still it didn't work the way that i expected.
Here is the revised code
>>> for i in glob.glob("*.jpg"):
... text = textract.process(i)
... if text.startswith('5'):
... print text
and the output from the revised code
5,13,66
17951-N
5,13,65_17959-N
Upvotes: 0
Views: 794
Reputation: 1501
So, taking into account your latest output, I think you should do this:
for i in glob.glob("*.jpg"):
text = textract.process(i).strip()
if text.startswith('5'):
print text
That will remove all leading and trailing whitespace from the output. It looks like there's a lot of trailing whitespace in your case which is causing extra lines to appear between each line.
Upvotes: 0
Reputation: 1208
Maybe you should try split the extracted text to lines first:
>>> for i in glob.glob("*.jpg"):
... text = textract.process(i)
... # Split text to multi lines
... for line in text.split('\n'):
... if line.startswith('5'):
... print line
Upvotes: 1