user3908734
user3908734

Reputation: 29

Matlab beginner median , mode and binning

I am a beginner with MATLAB and I am struggling with this assignment. Can anyone guide me through it?

Consider the data given below:

x = [ 1 , 48 , 81 , 2 , 10 , 25 , ,14 , 18 , 53 , 41, 56, 89,0, 1000, , ...
      34, 47, 455, 21, , 22, 100 ];

Once the data is loaded, see if you can find any:

  1. Outliers or
  2. Missing data in the data file

Correct the missing values using median, mode and noisy data using median binning, mean binning and bin boundaries.

Upvotes: 1

Views: 618

Answers (1)

rayryeng
rayryeng

Reputation: 104515

This isn't so bad. First off, take a look at the distribution of your data. You can see that the majority of your data has double digits. The outliers are those with single digits, or those that are way larger than double digits. Mind you, this is totally subjective so someone else may tell you that the single digits are part of your data too. Also, the missing data are those numbers that are spaces in between the commas. Let's write some MATLAB code and change these to NaN (or not-a-number), because if you try copying and pasting this code directly into MATLAB, it will give you a syntax error because if you are explicitly defining numbers this way, you have to be sure all of them are there.

To do this, use regexprep so that any parts of this string that have a comma, space, then another comma, put a NaN in between. To do this, we need to put this statement as a string first. We then use eval to convert this string to an actual MATLAB statement:

x = '[ 1 , 48 , 81 , 2 , 10 , 25 , ,14 , 18 , 53 , 41, 56, 89,0, 1000, , 34, 47, 455, 21, , 22, 100 ];'
y = eval(regexprep(x, ', ,', ', NaN, '));

If we display this data, we get:

y =

Columns 1 through 6

       1          48          81           2          10          25

Columns 7 through 12

     NaN          14          18          53          41          56

Columns 13 through 18

      89           0        1000         NaN          34          47

Columns 19 through 23

     455          21         NaN          22         100

As such, to answer our first question, any values that are missing are denoted as NaN and those numbers that are bigger than double digits are outliers.


For the next question, we simply extract those values that are not missing, calculate the mean and median of what is not missing, and fill in those NaN values with the mean and median. For the bin boundaries, this is the same thing as using the values to the left (or right... depends on your definition, but let's use left) of the missing value and fill those in. As such:

yMissing = isnan(y); %// Which values are missing?
y_noNaN = y(~yMissing); %// Extract the non-missing values
meanY = mean(y_noNaN); %// Get the mean
medianY = median(y_noNaN); %// Get the median

%// Output - Fill in missing values with median
yMedian = y;
yMedian(yMissing) = medianY;
%// Same for mean
yMean = y;
yMean(yMissing) = meanY;
%// Bin boundaries
yBinBound = y;
yBinBound(yMissing) = y(find(yMissing)-1);

The mean and median for the data of the non-missing values is:

meanY =

105.8500

medianY =

37.5000

The outputs for each of these, in addition to the original data with the missing values looks like:

format bank; %// Do this to show just the first two decimal places for compact output
format compact;

y =
Columns 1 through 5
         1          48          81           2          10
Columns 6 through 10
        25         NaN          14          18          53
Columns 11 through 15
        41          56          89           0        1000
Columns 16 through 20
       NaN          34          47         455          21
Columns 21 through 23
       NaN          22         100

yMean =
Columns 1 through 5
        1.00         48.00         81.00          2.00         10.00
Columns 6 through 10
       25.00        105.85         14.00         18.00         53.00
Columns 11 through 15
       41.00         56.00         89.00             0       1000.00
Columns 16 through 20
      105.85         34.00         47.00        455.00         21.00
Columns 21 through 23
      105.85         22.00        100.00

yMedian =
Columns 1 through 5
        1.00         48.00         81.00          2.00         10.00
Columns 6 through 10
       25.00         37.50         14.00         18.00         53.00
Columns 11 through 15
       41.00         56.00         89.00             0       1000.00
Columns 16 through 20
       37.50         34.00         47.00        455.00         21.00
Columns 21 through 23
       37.50         22.00        100.00

yBinBound =
Columns 1 through 5
        1.00         48.00         81.00          2.00         10.00
Columns 6 through 10
       25.00         25.00         14.00         18.00         53.00
Columns 11 through 15
       41.00         56.00         89.00             0       1000.00
Columns 16 through 20
     1000.00         34.00         47.00        455.00         21.00
Columns 21 through 23
       21.00         22.00        100.00

If you take a look at each of the output values, this fills in our data with the mean, median and also the bin boundaries as per the question.

Upvotes: 2

Related Questions