Josh Sharkey
Josh Sharkey

Reputation: 1038

Outlier detection approach with smaller datasets

I have a python function that takes a list of smaller images boxes (represented as arrays) and the whole image img in as a parameter and finds outliers. The outliers will either be significantly brighter or darker than the other images in the list, but darker is the more common case.

def find_outliers(boxes, img):
    means = [np.mean(box['src']) for box in boxes]
    asc = sorted(means)
    q1, q3 = np.percentile(asc, [25,75])
    iqr = q3 - q1
    lower = q1 - (1.5 * iqr)
    upper =  q3 + (1.5 * iqr)

    # print('thresholds:', lower, upper)
    return list(filter(lambda x: np.mean(x['src']) < lower or np.mean(x['src']) > upper, boxes))

This method allows me to create thresholds based on the image, instead of coming up with hard values, which is ideal in my situation. There are 3 problems I need to address if I continue this approach.

  1. Sometimes the brighter/darker images outnumber the normal images. These images have extreme values which biases my outlier method into thinking they are normal.
  2. Sometimes the number of boxes is very small (3 or 4). This makes it hard for this method to find an adequate lower and upper bound.
  3. The lower and upper bounds can be negative, but all of my values will be greater than or equal to 0.

Is there a statistical approach that is better suited for this type of problem? Is there a different way to establish a threshold values based on the image?

Note: I also have tried the standard deviation outlier approach but this one isn't suitable in this scenario.

Upvotes: 1

Views: 404

Answers (1)

Stef
Stef

Reputation: 30679

Rather than finding outliers in the list of boxes, we calculate the lower and upper boundaries with respect to the whole image and any boxes with average gray values outside these boundaries are considered as outliers:

def find_outliers(boxes, img):
    q1, q3 = np.percentile(img, [25,75])
    iqr = q3 - q1
    lower = q1 - (1.5 * iqr)
    upper =  q3 + (1.5 * iqr)

    # print('thresholds:', lower, upper)
    return list(filter(lambda x: np.mean(x['src']) < lower or np.mean(x['src']) > upper, boxes))

Upvotes: 1

Related Questions