Reputation: 43
I am doing a project (python language) which involves using OCR (using tesseract-ocr) to get text from an image and store it into file. Then I have to read the file character by character and perform some functions for the detected characters. The problem I encountered is that sometimes the file created after conversion has a lot of blank spaces (even blank lines) at the beginning of the text file. I do not have to use any function for the spaces so i want to ignore them all at once so that it could save my time. I am running the code on a raspberry-pi which has very less memory and it takes some time to compare each character and skip one by one.
camera.capture('test.png')
camera.resolution = (1920, 1080)
camera.brightness = 60
call(["tesseract","/home/pi/Desktop/fyp_try/test.png","/home/pi/Desktop/fyp_try/totext"])
f = open('/home/pi/Desktop/fyp_try/totext.txt','r')
message = f.read()
print(message)
for i in message:
print(i)
if(i>='a')and(i<='z'):
lst=a[i]
lstoperate()
elif(i>='A')and(i<='Z'):
lst=a['dot']
stoperate()
time.sleep(2)
smol=i.lower()
lst=a[smol]
lstoperate()
elif (i>='0')and(i<='9'):
lst=a['numsign']
lstoperate()
print(ord(i))
..............
operation on each character is followed by a sleep time of 2-3 seconds. this also happens when spaces are encountered. Is there any way i can ignore all the spaces at once till the beginning of a non space character in the file while reading it.
Upvotes: 2
Views: 261
Reputation: 155526
If you want to strip all the whitespace in a single operation with low resource costs, you'll want to avoid split
/join
(which works, but has a high temporary memory cost).
There are two obvious approaches, the lazy filtering approach:
from itertools import filterfalse
...
for i in filterfalse(str.isspace, message):
...
which never makes a new str
, but simply filters out the characters you don't care about as you go.
Or to strip them all up front (doubling initial memory consumption, but then dropping to just what the stripped version requires), use str.translate
:
from string import whitespace
dropspaces = str.maketrans('', '', whitespace)
...
message = f.read().translate(dropspaces)
That will strip all ASCII whitespace as if doing .replace(' ', '').replace('\n', '').replace('\r', '').etc...
, but in a single pass, producing a single output string with all the whitespace stripped at once.
Upvotes: 2
Reputation: 43
Can be done using various strip and join function as mentioned by John Szakmeister. Also Can refer to this link.
Upvotes: 0