Resize bounding box according to image

Question

I am implementing object localisation in Python. A problem I am having is that when I resize the observable region upon taking an action, I do not know how to change the ground truth box at the same time. Consequently, this occurs:

The ground truth box doesn't resize to fit the plane accurately. Thus, I cannot localize properly. My current function for formatting the next state is as follows:

def next_state(init_input, b, b_prime, g, a):
    """ 
    Returns the observable region of the next state.

    Formats the next state's observable region, defined
    by b_prime, to be of dimension (224, 224, 3). Adding 16
    additional pixels of context around the original bounding box.
    The ground truth box must be reformatted according to the
    new observable region.

    :param init_input:
        The initial input volume of the current episode.

    :param b:
        The current state's bounding box.

    :param b_prime:
        The subsequent state's bounding box.

    :param g:
        The ground truth box of the target object.

    :param a:
        The action taken by the agent at the current step.
    """

    # Determine the pixel coordinates of the observable region for the following state
    context_pixels = 16
    x1 = max(b_prime[0] - context_pixels, 0)
    y1 = max(b_prime[1] - context_pixels, 0)
    x2 = min(b_prime[2] + context_pixels, IMG_SIZE)
    y2 = min(b_prime[3] + context_pixels, IMG_SIZE)

    # Determine observable region
    observable_region = cv2.resize(init_input[y1:y2, x1:x2], (224, 224))

    # Difference between crop region and image dimensions
    x1_diff = x1
    y1_diff = y1
    x2_diff = IMG_SIZE - x2
    y2_diff = IMG_SIZE - y2

    # Resize ground truth box
    g[0] = int(g[0] - 0.5 * x1_diff)  # x1
    g[1] = int(g[1] - 0.5 * y1_diff)  # y1
    g[2] = int(g[2] + 0.5 * x2_diff)  # x2
    g[3] = int(g[3] + 0.5 * y2_diff)  # y2

    return observable_region, g

I just cannot seem to get the change in dimensions correct. I followed this post in order to initially resize the bounding boxes. Yet that solution doesn't seem to be working in this case.

Bounding box/ground truth box are formatted as: b = [x1, y1, x2, y2]

init_input is of dimensions (224, 224, 3). IMG_SIZE = 224 and context_pixels = 16

Here is an additional example:

It seems as if the size of the ground truth box is correct, however the location is off.

Update

I have updated the code section above. Scale factor seemed to be the wrong way to approach the problem. By just adding/subtracting the number of pixels to be up-scaled, I have gotten a lot closer. I believe now there is something to do with interpolation, so if anyone could help out with that to make it perfect that would be a huge help.

New example:

Update 2

A solution was provided.

Wizard · Accepted Answer

My problem was solved within this post by a user named @lenik.

Before applying the scale factor to the ground truth box g's pixel coordinates, you must firstly subtract the zero offset so that x1, y1 becomes 0, 0. This allows scaling to work properly.

Thus, the coordinates of any random point (x,y) after the transformation can be calculated as:

x_new = (x - x1) * IMG_SIZE / (x2 - x1)
y_new = (y - y1) * IMG_SIZE / (y2 - y1)

In code and in relation to my problem, the solution is as follows:

def next_state(init_input, b_prime, g):
    """
    Returns the observable region of the next state.

    Formats the next state's observable region, defined
    by b_prime, to be of dimension (224, 224, 3). Adding 16
    additional pixels of context around the original bounding box.
    The ground truth box must be reformatted according to the
    new observable region.

    :param init_input:
        The initial input volume of the current episode.

    :param b_prime:
        The subsequent state's bounding box.

    :param g:
        The ground truth box of the target object.
    """

    # Determine the pixel coordinates of the observable region for the following state
    context_pixels = 16
    x1 = max(b_prime[0] - context_pixels, 0)
    y1 = max(b_prime[1] - context_pixels, 0)
    x2 = min(b_prime[2] + context_pixels, IMG_SIZE)
    y2 = min(b_prime[3] + context_pixels, IMG_SIZE)

    # Determine observable region
    observable_region = cv2.resize(init_input[y1:y2, x1:x2], (224, 224), interpolation=cv2.INTER_AREA)

    # Resize ground truth box 
    g[0] = int((g[0] - x1) * IMG_SIZE / (x2 - x1))  # x1
    g[1] = int((g[1] - y1) * IMG_SIZE / (y2 - y1))  # y1
    g[2] = int((g[2] - x1) * IMG_SIZE / (x2 - x1))  # x2
    g[3] = int((g[3] - y1) * IMG_SIZE / (y2 - y1))  # y2

    return observable_region, g

Resize bounding box according to image

Update

Update 2

Answers (1)

Related Questions