Slobjo
Slobjo

Reputation: 83

Celery + Redis for multiprocessing

I build an app for optical character recognition using tesseract, openCV and Google Vision. I have 4 available types of documents to recognize (like receipts). So user can choose file (image), then choose the exact type of a document and click on "Recognize". Such a process will happen:

  1. Image Alignment with openCV: We have example images for each document type. When, as example, user chooses type A, openCV finds our example image for type A and starts searching for similarity. When it found enough similar elements, openCV aligns user's image file to reach the maximum similarity. Result of function - returning the path of aligned image.
  2. Validation with tesseract: After alignment tesseract opens previously aligned image and recognizes it. Then I have a list of words, which must be in recognized text, so we start checking through the recognized text and for each successful concidence we increase our validation counter. So finally if we our validation counter > some amount, we approve the result and consider the image file as a right one, which can be recognized by Google in the next step. If validation counter is not enough to approve, we set the document type as None object. Result of function - returning the path of approved aligned image and type of document, which user chose at the beginning.
  3. After this we start final function. If document type is not None (so that means that we have an approved image file), we pass this image path to function which will use Google Vision API for recognizing and then we get a .txt file.

So this is the common scenario. But I also provide users to choose the type of document, which is called "I don't know the type" ;) And the function, which executes after choosing this type, just goes through the for loop (iteration through all the types of documents we have). So as we return our document type every time we finish working wit openCV+Tesseract it's obvious that we will get only one not None type, which will be the right one and which we will use to recognize. So this function has the longest time of execution, because we need to make up to 4 openCV alignments and up to 4 tesseract recognitions to find the appropriate document type. I want to speed up my program by implementing multiprocessing for this particular function. I decided to use Celery+Redis for these purposes. The idea is:

  1. Set this loop function as a Celery task.
  2. Wait for finishing of all operations.
  3. Read responses from redis (there must be 4 responses, 3 of them are NoneType and 1 of them is the real Document Type).
  4. Clean redis.
  5. Using the right type in the final Google function.

So these are my functions, which I told above about:

raw_img_path - the path, which user chooses using UI dt - document type, which user chooses using UI

calc_receipt - for recognizing when we know the type of document:

def calc_receipt(self, raw_img_path, dt):
    aligned_img_path = OpenCV.align_img(
        template_path="some\\path\\for\\chosen\\type",
        raw_img_path="user\\image\\path",
        result_img_path="aligned\\image\\path",
    )

    tesseract_result = Tesseract.read_from_img(
        img_path="aligned\\image\\path",
    )

    if tesseract_result:
        return aligned_img_path, dt

    return '', DocumentType.NONE

calc_receipts - for recognizing when we don't know the type:

def calc_receipts(self, raw_img_path, selected_doc_type):

    for dt in map_receipt_to_receipts[selected_doc_type]:
        aligned_img_path, doc_type = self.calc_receipt(raw_img_path, dt)
        if doc_type is not DocumentType.NONE:
            return aligned_img_path, doc_type

list of available types:

map_receipt_to_receipts = {
    DocumentType.NONE: [],
    DocumentType.DONT_KNOW: [
        DocumentType.RESTAURANT,
        DocumentType.CAFE,
        DocumentType.BAR,
        DocumentType.COFFEE_SHOP,
    ],
    DocumentType.RESTAURANT: [
        DocumentType.RESTAURANT,
    ],
    DocumentType.CAFE: [
        DocumentType.CAFE,
    ],
    DocumentType.BAR: [
        DocumentType.BAR,
    ],
    DocumentType.COFFEE_SHOP: [
        DocumentType.COFFEE_SHOP,
    ],
}

class DocumentType is made for convenience:

class DocumentType(EnumBase):
    NONE = 0
    DONT_KNOW = 1
    RESTAURANT = 2
    CAFE = 3
    BAR = 4
    COFFEE_SHOP = 5

So as I can understand I need to rebuild the calc_receipts() function. I have some questions:

  1. Should I start celery worker from this function?
  2. Can I start celery client from Python code and not by using the console?
  3. How to manage redis responses and wait for all 4 operations.
  4. What can you commonly recommend for this case of using Celery+Redis? I'm so sorry if my questions are quite dummy, but it's incredibly hard for me to find the answers in the Web. Documentation isn't clear and beginner-friendly enough.

Upvotes: 0

Views: 458

Answers (1)

DejanLekic
DejanLekic

Reputation: 19787

A1: Celery worker is started as an independent process. Yes you can start it programmatically but that is quite rare (I use it for nearly 6 years and never needed that). I think once you get familiar with Celery you will find what is the best option for you. Until you find that you really need to start it this way I suggest running it as an independent process. In production you will probably want it as a systemd service.

A2: There is no such thing as "celery client" to be honest... Celery is built on top of messaging capabilities provided by supported brokers. Think of it in terms of distributed producer/consumer (or publish/subscribe) pattern. If by "client" you mean "producer" - something that will send a message (task) to particular queue (or using the default queue which is most common), then yes, you can do that programmatically, and in fact that is how most of us are doing it.

A3: You do not care about Redis. That is entirely Celery's job. If I understand you well, you need to construct a simple workflow, probably using a Chord primitive (since you need to wait until all tasks are done). For that you need to get familiar with Celery workflows.

A4: I use Celery with Redis (actually, AWS ElastiCache) as both broker and result-backend. I believe Redis is probably most commonly used broker for Celery, probably because it is extremely simple to setup and use. Celery documentation may seem unclear, but it contains lots of information. Also you have thousands blogs with articles describing how people use Celery in all sort of situations. As usual, start with something simple, make few workers, and try to execute few tiny tasks in distributed fashion. I learned Celery by following the First steps with Celery document.

Upvotes: 1

Related Questions