Reputation: 83
I build an app for optical character recognition using tesseract, openCV and Google Vision. I have 4 available types of documents to recognize (like receipts). So user can choose file (image), then choose the exact type of a document and click on "Recognize". Such a process will happen:
So this is the common scenario. But I also provide users to choose the type of document, which is called "I don't know the type" ;) And the function, which executes after choosing this type, just goes through the for loop (iteration through all the types of documents we have). So as we return our document type every time we finish working wit openCV+Tesseract it's obvious that we will get only one not None type, which will be the right one and which we will use to recognize. So this function has the longest time of execution, because we need to make up to 4 openCV alignments and up to 4 tesseract recognitions to find the appropriate document type. I want to speed up my program by implementing multiprocessing for this particular function. I decided to use Celery+Redis for these purposes. The idea is:
So these are my functions, which I told above about:
raw_img_path - the path, which user chooses using UI dt - document type, which user chooses using UI
calc_receipt - for recognizing when we know the type of document:
def calc_receipt(self, raw_img_path, dt):
aligned_img_path = OpenCV.align_img(
template_path="some\\path\\for\\chosen\\type",
raw_img_path="user\\image\\path",
result_img_path="aligned\\image\\path",
)
tesseract_result = Tesseract.read_from_img(
img_path="aligned\\image\\path",
)
if tesseract_result:
return aligned_img_path, dt
return '', DocumentType.NONE
calc_receipts - for recognizing when we don't know the type:
def calc_receipts(self, raw_img_path, selected_doc_type):
for dt in map_receipt_to_receipts[selected_doc_type]:
aligned_img_path, doc_type = self.calc_receipt(raw_img_path, dt)
if doc_type is not DocumentType.NONE:
return aligned_img_path, doc_type
list of available types:
map_receipt_to_receipts = {
DocumentType.NONE: [],
DocumentType.DONT_KNOW: [
DocumentType.RESTAURANT,
DocumentType.CAFE,
DocumentType.BAR,
DocumentType.COFFEE_SHOP,
],
DocumentType.RESTAURANT: [
DocumentType.RESTAURANT,
],
DocumentType.CAFE: [
DocumentType.CAFE,
],
DocumentType.BAR: [
DocumentType.BAR,
],
DocumentType.COFFEE_SHOP: [
DocumentType.COFFEE_SHOP,
],
}
class DocumentType is made for convenience:
class DocumentType(EnumBase):
NONE = 0
DONT_KNOW = 1
RESTAURANT = 2
CAFE = 3
BAR = 4
COFFEE_SHOP = 5
So as I can understand I need to rebuild the calc_receipts() function. I have some questions:
Upvotes: 0
Views: 458
Reputation: 19787
A1: Celery worker is started as an independent process. Yes you can start it programmatically but that is quite rare (I use it for nearly 6 years and never needed that). I think once you get familiar with Celery you will find what is the best option for you. Until you find that you really need to start it this way I suggest running it as an independent process. In production you will probably want it as a systemd service.
A2: There is no such thing as "celery client" to be honest... Celery is built on top of messaging capabilities provided by supported brokers. Think of it in terms of distributed producer/consumer (or publish/subscribe) pattern. If by "client" you mean "producer" - something that will send a message (task) to particular queue (or using the default queue which is most common), then yes, you can do that programmatically, and in fact that is how most of us are doing it.
A3: You do not care about Redis. That is entirely Celery's job. If I understand you well, you need to construct a simple workflow, probably using a Chord primitive (since you need to wait until all tasks are done). For that you need to get familiar with Celery workflows.
A4: I use Celery with Redis (actually, AWS ElastiCache) as both broker and result-backend. I believe Redis is probably most commonly used broker for Celery, probably because it is extremely simple to setup and use. Celery documentation may seem unclear, but it contains lots of information. Also you have thousands blogs with articles describing how people use Celery in all sort of situations. As usual, start with something simple, make few workers, and try to execute few tiny tasks in distributed fashion. I learned Celery by following the First steps with Celery document.
Upvotes: 1