How to clean up this python output

Question

I was trying to use a python module textract to extract the text from images and since the images contains so much noise the output i was getting is noise in addition to the actual text that i am interested in. Can someone suggest the code to best ways to clean up the output.

Here is my code:

>>> for i in glob.glob("*.jpg"):
...     print(textract.process(i))

Here is my output:

...






-s.

4â€˜-0-.r-v .-

5,14,45_18685-M

c.

.4








"V-0-an .-

5,14,44_17793-M


5,13,66

17951-N


5,13,65_17959-N

Basically what i want is the lines that starts with number "5" and nothing else. So i added a line to my code above but still it didn't work the way that i expected.

Here is the revised code

>>> for i in glob.glob("*.jpg"):
...     text = textract.process(i)
...     if text.startswith('5'):
...             print text

and the output from the revised code

5,13,66

17951-N


5,13,65_17959-N

piglei · Accepted Answer

Maybe you should try split the extracted text to lines first:

>>> for i in glob.glob("*.jpg"):
...     text = textract.process(i)
...     # Split text to multi lines
...     for line in text.split('
'):
...         if line.startswith('5'):
...                 print line

How to clean up this python output

Answers (2)

Related Questions