toothsie
toothsie

Reputation: 255

Bin data by x and y columns, and output the mean of a third column

I have three columns like so:

x     y     Value
0.5   0.5     3
2.3   1.2     5
2.7   1.6     10
3.3   4.1     4
3.5   4.2     6
3.8   4.6     8

I want to bin columns x and y and find the mean of column Value. ie. the average would be found between points (3,3) and (4,4), which would equal to 4+6+8/3 = 6. So the output should be like so:

x_bin     y_bin      mean_value
 0, 1      0, 1        3
 0, 1      1, 2        0
 0, 1      2, 3        0
 0, 1      3, 4        0
 1, 2      0, 1        0
 1, 2      1, 2        0
 1, 2      2, 3        0
 1, 2      3, 4        0
 2, 3      0, 1        0  
 2, 3      1, 2        7.5
 2, 3      2, 3        0
 2, 3      3, 4        0
 3, 4      0, 1        0 
 3, 4      1, 2        0
 3, 4      2, 3        0
 3, 4      3, 4        6

Ideally, I would like the output in a format where I could plot this as a heatmap grid.

Thanks in advance.

Upvotes: 0

Views: 123

Answers (1)

JohanC
JohanC

Reputation: 80329

np.histogram2d is numpy's function to bin 2D data. By default, the result counts the points into each bin. With the weights= parameter, the points are weighted, and those weights are summed. Dividing the summed weights by the counts gives the means.

Seaborn's sns.heatmap can display these means and automatically annotate the cell values.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

x = [0.5, 2.3, 2.7, 3.3, 3.5, 3.8]
y = [0.5, 1.2, 1.6, 4.1, 4.2, 4.6]
value = [3, 5, 10, 4, 6, 8]
bins = (np.arange(6), np.arange(6))
sums, _, _ = np.histogram2d(x, y, bins=bins, weights=value)
counts, _, _ = np.histogram2d(x, y, bins=bins)
with  np.errstate(divide='ignore', invalid='ignore'):  # divide 0 by 0 results in NaN
    means = sums / counts
sns.set_style('white')
ax = sns.heatmap(means, annot=True, fmt='.1f', cmap='turbo', vmin=0, vmax=10, square=True,
                cbar=True, cbar_kws={'ticks': np.arange(11)})
ax.set_xticks(range(len(bins[0])))
ax.set_xticklabels(bins[0])
ax.set_yticks(range(len(bins[1])))
ax.set_yticklabels(bins[1])
ax.tick_params(labelrotation=0)
ax.grid(axis='both', color='0.3', clip_on=False)
ax.set_axisbelow(False)
plt.tight_layout()
plt.show()

The image shows the plots for:

  • bins = (np.arange(6), np.arange(6)),
  • bins = ([0, 2, 4, 6], [0, 2, 4, 6]) and
  • bins = (np.arange(0, 5.0001, 0.5), np.arange(0, 5.0001, 0.5))

sns.heatmap with means of 2d weights

Upvotes: 1

Related Questions