Reputation:
I have three columns. Column A consists of numbers, column B consists of bin ranges, and column C consists of number data relevant to the individual data in column A.
Using columns A and B, I created a frequency histogram where all the data in column A have been grouped into the bins of column B. I would like to calculate the average value of each bin using the data from column C (i.e., calculate a mean value for each bin using data from column C that is associated to each value (from column A) that made up each bin).
Can anybody help?
Thanks for the replies. Here is an example of the data (Unfortunately I can not paste in images):
Below are three columns with headers Jar Type (in volume (ml)), Cookies (he number of chocolate chip cookies in the jar), and Interval for bins (bins to count the jar types):
Making a histogram of the frequency of jar types gives this grouping:
Now what I am trying to do is to find out what is the mean number of cookies that can be found in each type of jar. For example, for the 500ml we know that there are 4x500ml jars, and that in each of the 500ml we have 3+1+0.5+1 = 5.5 cookies in total. the mean would be 1.735 cookies.
My issue is that I have 5000+ numbers that separate into 100 bins.
Upvotes: 2
Views: 5595
Reputation: 2494
The question calls for a "wandering trace" of a scatterplot: the values of column A (plot them on the horizontal axis) are placed into bins, which therefore comprise vertical strips in the scatterplot. The values of column C (plotted on the vertical axis) are averaged within each strip. This technique smooths out and summarizes apparent trends in the scatterplot.
In this example with 100 records the original data are in black and computed values are in green. Here is the wandering trace of means:
The open circles plot column C
(associated values) against column A
(data) while the solid squares, connected with a dashed red trace, plot the bin means (column G
) against the midpoints (column F
).
Any statistical package will provide functions for grouping data and performing operations on those groups. Excel does this to a limited extent with its SUMIF
and COUNTIF
functions. To use them, create a column (D
in the spreadsheet) showing the grouping factor. (That's a simple lookup in the sorted BINS
vector using the VLOOKUP
function with its "range" option set to true.) SUMIF
computes sums by group factor and COUNTIF
counts by group factor. Their ratios are the bin means.
Here is what the formulas look like:
Only three formulas were actually entered and then copied down as needed:
=VLOOKUP(A2, Bins, 1, TRUE)
computes the group for the value in cell A2
. Bins
a name for the array $(-2,-3, \ldots, 3)$ in column B
.
=AVERAGE(B3:B4)
computes the midpoint of the first bin. This was used as a horizontal plotting position in the scatterplot.
=SUMIF(Bin,"="&B3,NewValues)/COUNTIF(Bin, "="&B3)
is where all the work is done. Bin
refers to the group codes in column D and NewValues
refers to the associated values in column C. The tricky parts are the constructs "="&B3
: these form a text value instructing the data to be grouped by comparison to the number in cell B3
, which is the first endpoint. Because this is a formula, copying it down automatically updates the B3
to B4
, then B5
, etc.
Upvotes: 3