CHP
CHP

Reputation: 950

Matlab - Remove bad data from vector of values

I have a vector, stdclock, which holds values that follow this pattern:

stdclock=[13 25 38 50 63 75 88 100 113 125 138 150 163 175 188 200 213 2517 2529 2542 2554 2567 2579 2592 2604 2617 2629 2642 2654 2667 2679 2692 2704 2717]

This data is generated through an encoding of 17 values that come 12 or 13 numbers apart (e.g. 25-13=12, 38-25 = 13, etc). You'll see that the first 17 values follow this pattern. Each group of 17 values encode an object, which we'll call an 'item' and are independent of the subsequent 17 values. Then, between value 17 and 18, there's a much larger difference than 12 or 13, but it could be any number higher than, say, 15. This difference represents a separation qualitative separation in the data such that the first 17 values encode one item, the next 17 values encode another item, etc etc. The difference between the 17th and 18th value will never be as small as 12 or 13. Therefore, I can check for any values >= 15, and be sure that I can separate my data in this way. Alternatively, I can reshape the vector as a 17xlength(stdclock)/17 matrix.

So far so good. The problem is that this vector is generated through hardware which can sometimes have errors such that one or more values is simply dropped and not recorded. I want to figure out an algorithm that will detect that values are missing from an 'item' and then remove all remaining values from that item.

I can't quite wrap my head around how to do this in a way that will work for all patterns of errors (e.g. if an item can have missing numbers anywhere, in any pattern, and neighboring items may also have missing numbers anywhere in any pattern, or nowhere).

Any help would be appreciated. An example of a 'corrupted' item would be like this

stdclock=[13 25 38 50 63 75 88 100 113 125 138 150 163 175 188 200 213 2529 2542 2554 2567 2579 2592 2604 2642 2654 2679 2692 2704]

where this stdclock is the same as the one on top, but I went through in the second item and randomly removed numbers, including the first and last numbers.

Upvotes: 1

Views: 2153

Answers (1)

Jonas
Jonas

Reputation: 74940

If you can assume that the difference between consecutive groups is always larger than some threshold, you can use the approach below: identify consecutive groups, and throw out all groups of a length less than 17. It turns out that the threshold for a new group can be set as low as 15, since a missing data point will split a group of 17 into two shorter groups, which will then both be removed.

stdclock=[13 25 38 50 63 75 88 100 113 125 138 150 163 175 188 200 213 2529 2542 2554 2567 2579 2592 2604 2642 2654 2679 2692 2704];

%# a difference of more than groupDelta indicates a new (pseudo-)group
groupDelta = 15; 
groupJump = [1 diff(stdclock) > groupDelta];

%# number the groups
groupNumber = cumsum(groupJump);

%# count, for each group, the numbers. 
groupCounts = hist(groupNumber,1:groupNumber(end));

%# if a group contains fewer than 17 entries, throw it out
badGroup = find(groupCounts < 17);
stdclock(ismember(groupNumber,badGroup)) = [];


stdclock =
    13    25    38    50    63    75    88   100   113   125   138   150   163   175   188   200   213

Upvotes: 2

Related Questions