DataScienceNovice
DataScienceNovice

Reputation: 502

Extracting numerical information from PNGs

I apologize if this question is innapropriate for this site.

I have several hundred images of graphs; each graph is a PNG. They all look like the following:

enter image description here

The x-axis is labeled with all possible categories (grades). The y-axis shows the percentage of kids that got a certain grade. All the graphs follow this format; there are no deviations.

Using Python, what would be the most effective way to extract the data from such an image? My goal is to extract the percentage values for each grade category, so I can do some further analysis - I'm trying to see which classes have the highest percentage of A+/A grades, so I can plan for next term.

Of course, what I would really only need is the relative heights of the bars, and I could calculate ratios based on that information. This could be accomplished with Otsu thresholding using something like OpenCV; is there any easier way to do what I want? I'm sure this has been done before; if anyone could point me to a (preferably Python) repo or tutorial, that would be great.

Upvotes: 1

Views: 533

Answers (1)

Bahnschrift
Bahnschrift

Reputation: 26

Assuming the graphs all have the same dimensions, number of columns, etc, one way to do this could be to get the height of each column in pixels, then compare those. To get the height of each column you can use the library PIL.

Firstly, based on the image you uploaded, the bottom of each column is at the pixel y = 523 (with the top of the image being y = 0) and the center of the first column being at x = 136. Furthermore, the center of each column is either 45 or 46 pixels after the last (this alternates), and there are 15 columns.

Based on this, you can use this script to get the height of each column in a graph:

from PIL import Image
def col_heights(filename):
    img = Image.open(filename)
    cols = []
    sy = 523  # The y level of the bottom of each column
    x = 136  # The x position of the first column
    add_45_or_46 = False  # False to increment by 45, True for 46
    
    num_cols = 15
    for _ in range(num_cols):
        y = sy
        while img.getpixel((x, y)) != (255, 255, 255, 255):
            y -= 1  # Work upwards
        cols.append(sy - y)
        
        x += 46 if add_45_or_46 else 45
        add_45_or_46 = not add_45_or_46
    
    img.close()
    return cols

So what does this do? It firstly opens the image, then sets the starting values for x (the first columns x position), sy (the starting y level of every column) and whether to add 45 or 46 to get to the next column. Then, for each column, it works upwards until it finds a pixel that doesn't match the pixel at the bottom of the column (i.e. isn't white), then adds the height of that column the the list of column heights.

For example, for the graph you uploaded, the heights of each column are [220, 430, 242, 143, 54, 32, 0, 10, 0, 10, 0, 0, 43, 176, 21].

Upvotes: 1

Related Questions