upendra
upendra

Reputation: 2189

How to clean up this python output

I was trying to use a python module textract to extract the text from images and since the images contains so much noise the output i was getting is noise in addition to the actual text that i am interested in. Can someone suggest the code to best ways to clean up the output.

Here is my code:

>>> for i in glob.glob("*.jpg"):
...     print(textract.process(i))

Here is my output:

...






-s.

4‘-0-.r-v .-

5,14,45_18685-M

c.

.4








"V-0-an .-

5,14,44_17793-M


5,13,66

17951-N


5,13,65_17959-N

Basically what i want is the lines that starts with number "5" and nothing else. So i added a line to my code above but still it didn't work the way that i expected.

Here is the revised code

>>> for i in glob.glob("*.jpg"):
...     text = textract.process(i)
...     if text.startswith('5'):
...             print text

and the output from the revised code

5,13,66

17951-N


5,13,65_17959-N

Upvotes: 0

Views: 794

Answers (2)

matthewatabet
matthewatabet

Reputation: 1501

So, taking into account your latest output, I think you should do this:

for i in glob.glob("*.jpg"):
    text = textract.process(i).strip()
    if text.startswith('5'):
        print text

That will remove all leading and trailing whitespace from the output. It looks like there's a lot of trailing whitespace in your case which is causing extra lines to appear between each line.

Upvotes: 0

piglei
piglei

Reputation: 1208

Maybe you should try split the extracted text to lines first:

>>> for i in glob.glob("*.jpg"):
...     text = textract.process(i)
...     # Split text to multi lines
...     for line in text.split('\n'):
...         if line.startswith('5'):
...                 print line

Upvotes: 1

Related Questions