Reputation: 1663
I am implementing object localisation in Python. A problem I am having is that when I resize the observable region upon taking an action, I do not know how to change the ground truth box at the same time. Consequently, this occurs:
The ground truth box doesn't resize to fit the plane accurately. Thus, I cannot localize properly. My current function for formatting the next state is as follows:
def next_state(init_input, b, b_prime, g, a):
"""
Returns the observable region of the next state.
Formats the next state's observable region, defined
by b_prime, to be of dimension (224, 224, 3). Adding 16
additional pixels of context around the original bounding box.
The ground truth box must be reformatted according to the
new observable region.
:param init_input:
The initial input volume of the current episode.
:param b:
The current state's bounding box.
:param b_prime:
The subsequent state's bounding box.
:param g:
The ground truth box of the target object.
:param a:
The action taken by the agent at the current step.
"""
# Determine the pixel coordinates of the observable region for the following state
context_pixels = 16
x1 = max(b_prime[0] - context_pixels, 0)
y1 = max(b_prime[1] - context_pixels, 0)
x2 = min(b_prime[2] + context_pixels, IMG_SIZE)
y2 = min(b_prime[3] + context_pixels, IMG_SIZE)
# Determine observable region
observable_region = cv2.resize(init_input[y1:y2, x1:x2], (224, 224))
# Difference between crop region and image dimensions
x1_diff = x1
y1_diff = y1
x2_diff = IMG_SIZE - x2
y2_diff = IMG_SIZE - y2
# Resize ground truth box
g[0] = int(g[0] - 0.5 * x1_diff) # x1
g[1] = int(g[1] - 0.5 * y1_diff) # y1
g[2] = int(g[2] + 0.5 * x2_diff) # x2
g[3] = int(g[3] + 0.5 * y2_diff) # y2
return observable_region, g
I just cannot seem to get the change in dimensions correct. I followed this post in order to initially resize the bounding boxes. Yet that solution doesn't seem to be working in this case.
Bounding box/ground truth box are formatted as: b = [x1, y1, x2, y2]
init_input
is of dimensions (224, 224, 3)
. IMG_SIZE = 224
and context_pixels = 16
Here is an additional example:
It seems as if the size of the ground truth box is correct, however the location is off.
I have updated the code section above. Scale factor seemed to be the wrong way to approach the problem. By just adding/subtracting the number of pixels to be up-scaled, I have gotten a lot closer. I believe now there is something to do with interpolation, so if anyone could help out with that to make it perfect that would be a huge help.
New example:
A solution was provided.
Upvotes: 4
Views: 10616
Reputation: 1663
My problem was solved within this post by a user named @lenik.
Before applying the scale factor to the ground truth box g
's pixel coordinates, you must firstly subtract the zero offset so that x1, y1
becomes 0, 0
. This allows scaling to work properly.
Thus, the coordinates of any random point (x,y)
after the transformation can be calculated as:
x_new = (x - x1) * IMG_SIZE / (x2 - x1)
y_new = (y - y1) * IMG_SIZE / (y2 - y1)
In code and in relation to my problem, the solution is as follows:
def next_state(init_input, b_prime, g):
"""
Returns the observable region of the next state.
Formats the next state's observable region, defined
by b_prime, to be of dimension (224, 224, 3). Adding 16
additional pixels of context around the original bounding box.
The ground truth box must be reformatted according to the
new observable region.
:param init_input:
The initial input volume of the current episode.
:param b_prime:
The subsequent state's bounding box.
:param g:
The ground truth box of the target object.
"""
# Determine the pixel coordinates of the observable region for the following state
context_pixels = 16
x1 = max(b_prime[0] - context_pixels, 0)
y1 = max(b_prime[1] - context_pixels, 0)
x2 = min(b_prime[2] + context_pixels, IMG_SIZE)
y2 = min(b_prime[3] + context_pixels, IMG_SIZE)
# Determine observable region
observable_region = cv2.resize(init_input[y1:y2, x1:x2], (224, 224), interpolation=cv2.INTER_AREA)
# Resize ground truth box
g[0] = int((g[0] - x1) * IMG_SIZE / (x2 - x1)) # x1
g[1] = int((g[1] - y1) * IMG_SIZE / (y2 - y1)) # y1
g[2] = int((g[2] - x1) * IMG_SIZE / (x2 - x1)) # x2
g[3] = int((g[3] - y1) * IMG_SIZE / (y2 - y1)) # y2
return observable_region, g
Upvotes: 1