Slobjo
Slobjo

Reputation: 83

Find dictionaries with same values and add new keys to them

have such a question. I have a list with dictionaries, which contain information about words, which were recognized from image using Google API. So my list looks like:

test_list = [
   {
      "value": "68004,",
      "location": {
         "TL": {
            "x": 351,
            "y": 0
         },
         "TR": {
            "x": 402,
            "y": 0
         },
         "BR": {
            "x": 402,
            "y": 12
         },
         "BL": {
            "x": 351,
            "y": 12
         }
      },
      "type": 1
   },
   {
      "value": "Чорномор",
      "location": {
         "TL": {
            "x": 415,
            "y": 0
         },
         "TR": {
            "x": 493,
            "y": 0
         },
         "BR": {
            "x": 493,
            "y": 12
         },
         "BL": {
            "x": 415,
            "y": 12
         }
      },
      "type": 1
   },
   {
      "value": "вулиця,",
      "location": {
         "TL": {
            "x": 495,
            "y": 14
         },
         "TR": {
            "x": 550,
            "y": 10
         },
         "BR": {
            "x": 551,
            "y": 22
         },
         "BL": {
            "x": 496,
            "y": 26
         }
      },
      "type": 1
   },
   {
      "value": "140,",
      "location": {
         "TL": {
            "x": 557,
            "y": 8
         },
         "TR": {
            "x": 576,
            "y": 7
         },
         "BR": {
            "x": 577,
            "y": 20
         },
         "BL": {
            "x": 558,
            "y": 21
         }
      },
      "type": 1
   },
   {
      "value": "кв.",
      "location": {
         "TL": {
            "x": 581,
            "y": 6
         },
         "TR": {
            "x": 605,
            "y": 4
         },
         "BR": {
            "x": 606,
            "y": 21
         },
         "BL": {
            "x": 582,
            "y": 23
         }
      },
      "type": 1
   },
   {
      "value": "77",
      "location": {
         "TL": {
            "x": 607,
            "y": 5
         },
         "TR": {
            "x": 628,
            "y": 4
         },
         "BR": {
            "x": 629,
            "y": 19
         },
         "BL": {
            "x": 608,
            "y": 21
         }
      },
      "type": 1
   },
]

So I want to find, if some dictionaries have the same location parameters and if it is true, add a new key "string_number" with the same index to these dictionaries. As example in the above code two first dictionaries have the same ["location"]["TL"]["y"] and ["location"]["TR"]["y"] == 0px, and also ["location"]["BR"]["y"] and ["location"]["BL"]["y"] == 12px. So that means that these words are placed in a one string in a real document, so I want to add to them new key "string_number" with index 0. That will look like:

test_list = [
   {
      "value": "68004,",
      "location": {
         "TL": {
            "x": 351,
            "y": 0
         },
         "TR": {
            "x": 402,
            "y": 0
         },
         "BR": {
            "x": 402,
            "y": 12
         },
         "BL": {
            "x": 351,
            "y": 12
         }
      },
      "type": 1
      "string_number": 0
   },
   {
      "value": "Чорномор",
      "location": {
         "TL": {
            "x": 415,
            "y": 0
         },
         "TR": {
            "x": 493,
            "y": 0
         },
         "BR": {
            "x": 493,
            "y": 12
         },
         "BL": {
            "x": 415,
            "y": 12
         }
      },
      "type": 1
      "string_number": 0
   },

Then going through the rest of a list I want to find every such a duplication and set the same string index to them. However, sometimes pixels can di ffer by 1-2 points less or more (so as example "y": 12 or 10 or 14 probably still say that word is on the same line in a document). So is it real to make an extra check for this difference?

EDIT: so I used the help of Aleksa Svitlica and created a class, which makes all work about searching for words on the same line. So it looks like:

class WordParser():
    def __init__(self):
        self.list_wbw = self.load_json()
        self.next_new_string_number = 0

    def load_json(self):
        with io.open(calc_paths(status="now", path_type=PathType.OCR_JSON_WBW), 'r', encoding='utf-8') as content:
            self.list_wbw = json.load(content)
        content.close()
        return self.list_wbw

    def mark_images_on_same_line(self):
        number_of_images = len(self.list_wbw)
        for i in range(number_of_images):
            for j in range(i + 1, number_of_images):
                image1 = self.list_wbw[i]
                image2 = self.list_wbw[j]
                on_same_line = self._check_if_images_on_same_line(image1, image2)

                if on_same_line:
                    self._add_string_number_to_images(image1, image2)

    def print_images(self):
        print(json.dumps(self.list_wbw, indent=3, sort_keys=False, ensure_ascii=False))

    def _check_if_images_on_same_line(self, image1, image2):
        image1_top_left = image1["location"]["TL"]["y"]
        image1_top_right = image1["location"]["TR"]["y"]
        image1_bot_left = image1["location"]["BL"]["y"]
        image1_bot_right = image1["location"]["BR"]["y"]

        image2_top_left = image2["location"]["TL"]["y"]
        image2_top_right = image2["location"]["TR"]["y"]
        image2_bot_left = image2["location"]["BL"]["y"]
        image2_bot_right = image2["location"]["BR"]["y"]

        same_top_left_position = self._pixel_heights_match_within_threshold(image1_top_left, image2_top_left)
        same_top_right_position = self._pixel_heights_match_within_threshold(image1_top_right, image2_top_right)
        same_bot_left_position = self._pixel_heights_match_within_threshold(image1_bot_left, image2_bot_left)
        same_bot_right_position = self._pixel_heights_match_within_threshold(image1_bot_right, image2_bot_right)

        if same_top_left_position and same_top_right_position and same_bot_left_position and same_bot_right_position:
            self._add_string_number_to_images(image1, image2)

    def _add_string_number_to_images(self, image1, image2):
        string_number = self._determine_string_number(image1, image2)
        image1["string_number"] = string_number
        image2["string_number"] = string_number

    def _determine_string_number(self, image1, image2):
        string_number = self.next_new_string_number

        image1_number = image1.get("string_number")
        image2_number = image2.get("string_number")

        if image1_number is not None:
            string_number = image1_number
        elif image2_number is not None:
            string_number = image2_number
        else:
            self.next_new_string_number += 1

        return string_number

    def _pixel_heights_match_within_threshold(self, height1, height2, threshold=4):
        return abs(height1 - height2) <= threshold

And in my another module, where I call these methods:

    word_parser = WordParser()
    word_parser.mark_images_on_same_line()
    word_parser.print_images()

Upvotes: 1

Views: 63

Answers (1)

Svit
Svit

Reputation: 311

Adding the following code after your test_list I got the output you can see below. My code currently just checks that the heights of TR and TL are within a threshold (defaults to 2 pixel threshold). But you could modify it depending on your requirements. In _check_if_images_on_same_line you can change the rules as you like.

import json

#-------------------------------------------------------------------
#---Classes---------------------------------------------------------
#-------------------------------------------------------------------
class ImageParser():
    def __init__(self, list_of_images):
        self.list_of_images = list_of_images
        self.next_new_string_number = 0

    # ----------------------------------------------------------------------------
    # ---Public-------------------------------------------------------------------
    # ----------------------------------------------------------------------------

    def mark_images_on_same_line(self):
        number_of_images = len(self.list_of_images)
        for i in range(number_of_images):
            for j in range(i+1, number_of_images):
                image1 = self.list_of_images[i]
                image2 = self.list_of_images[j]
                on_same_line = self._check_if_images_on_same_line(image1, image2)

                if on_same_line:
                    self._add_string_number_to_images(image1, image2)

    def print_images(self):
        print(json.dumps(self.list_of_images, indent=True, sort_keys=False, ensure_ascii=False))

    # ----------------------------------------------------------------------------
    # ---Private------------------------------------------------------------------
    # ----------------------------------------------------------------------------
    def _check_if_images_on_same_line(self, image1, image2):
        image1_top = image1["location"]["TL"]["y"]
        image1_bot = image1["location"]["BL"]["y"]

        image2_top = image2["location"]["TL"]["y"]
        image2_bot = image2["location"]["BL"]["y"]

        same_top_position = self._pixel_heights_match_within_threshold(image1_top, image2_top)
        same_bot_position = self._pixel_heights_match_within_threshold(image1_bot, image2_bot)

        if same_bot_position & same_top_position:
            self._add_string_number_to_images(image1, image2)

    def _add_string_number_to_images(self, image1, image2):
        string_number = self._determine_string_number(image1, image2)
        image1["string_number"] = string_number
        image2["string_number"] = string_number

    def _determine_string_number(self, image1, image2):
        string_number = self.next_new_string_number

        image1_number = image1.get("string_number")
        image2_number = image2.get("string_number")

        if image1_number is not None:
            string_number = image1_number
        elif image2_number is not None:
            string_number = image2_number
        else:
            self.next_new_string_number += 1

        return string_number

    def _pixel_heights_match_within_threshold(self, height1, height2, threshold=2):
        return abs(height1 - height2) <= threshold


#-------------------------------------------------------------------
#---Main------------------------------------------------------------
#-------------------------------------------------------------------
if __name__ == "__main__":
    image_parser = ImageParser(test_list)
    image_parser.mark_images_on_same_line()
    image_parser.print_images()

Gives the following results:

[
 {
  "value": "68004,",
  "location": {
   "TL": {
    "x": 351,
    "y": 0
   },
   "TR": {
    "x": 402,
    "y": 0
   },
   "BR": {
    "x": 402,
    "y": 12
   },
   "BL": {
    "x": 351,
    "y": 12
   }
  },
  "type": 1,
  "string_number": 0
 },
 {
  "value": "Чорномор",
  "location": {
   "TL": {
    "x": 415,
    "y": 0
   },
   "TR": {
    "x": 493,
    "y": 0
   },
   "BR": {
    "x": 493,
    "y": 12
   },
   "BL": {
    "x": 415,
    "y": 12
   }
  },
  "type": 1,
  "string_number": 0
 },
 {
  "value": "вулиця,",
  "location": {
   "TL": {
    "x": 495,
    "y": 14
   },
   "TR": {
    "x": 550,
    "y": 10
   },
   "BR": {
    "x": 551,
    "y": 22
   },
   "BL": {
    "x": 496,
    "y": 26
   }
  },
  "type": 1
 },
 {
  "value": "140,",
  "location": {
   "TL": {
    "x": 557,
    "y": 8
   },
   "TR": {
    "x": 576,
    "y": 7
   },
   "BR": {
    "x": 577,
    "y": 20
   },
   "BL": {
    "x": 558,
    "y": 21
   }
  },
  "type": 1,
  "string_number": 1
 },
 {
  "value": "кв.",
  "location": {
   "TL": {
    "x": 581,
    "y": 6
   },
   "TR": {
    "x": 605,
    "y": 4
   },
   "BR": {
    "x": 606,
    "y": 21
   },
   "BL": {
    "x": 582,
    "y": 23
   }
  },
  "type": 1,
  "string_number": 1
 },
 {
  "value": "77",
  "location": {
   "TL": {
    "x": 607,
    "y": 5
   },
   "TR": {
    "x": 628,
    "y": 4
   },
   "BR": {
    "x": 629,
    "y": 19
   },
   "BL": {
    "x": 608,
    "y": 21
   }
  },
  "type": 1,
  "string_number": 1
 }
]

Upvotes: 1

Related Questions