Poulomi
Poulomi

Reputation: 25

Extract rows of matrices with nearest days record: MATLAB

I have two matrices A and B. Both have different sizes and 1st, 2nd, 3rd, & 4th value show year, month, day and values in both matrices. I need to extract rows with same year and month however, day of +/-6 days from matrix A and related rows form matrix B. If two or more days are close in matrices A & B, I should choose the rows corresponding to highest value from both matrices.

A = 1954    1   16  2,3042
1954    12  5   2,116
1954    12  21  1,9841
1954    12  22  2,7411
1955    1   13  1,8766
1955    10  16  1,4003
1955    12  29  1,4979
1956    1   19  2,1439
1956    1   21  1,7666
1956    11  26  1,7367
1956    11  27  1,8914
1957    1   27  1,151
1957    2   2   1,1484
1957    12  29  1,1906
1957    12  30  1,3157
1958    1   10  1,6186
1958    1   20  1,1637
1958    2   6   1,1639
1958    10  16  1,1444
1959    1   3   1,7784
1959    1   24  1,1871
1959    2   20  1,2264
1959    10  25  1,2194
1960    6   29  1,2327
1960    12  4   1,7213
1960    12  5   1,373
1961    3   21  1,7149
1961    3   27  1,4404
1961    11  3   1,3934
1961    12  5   1,777
1962    2   12  2,1813
1962    2   16  3,5776
1962    2   17  1,9236
1963    9   27  1,6164
1963    10  13  1,786
1963    10  14  1,9203
1963    11  22  1,7575
1964    2   2   1,4402
1964    11  15  1,437
1964    11  17  1,7588
1964    12  4   1,6358
1965    2   13  1,874
1965    11  2   2,6468
1965    11  26  1,7163
1965    12  11  1,8283
1966    12  1   2,1165
1966    12  19  1,6672
1966    12  24  1,8173
1966    12  25  1,4923
1967    2   23  2,3002
1967    3   1   1,9614
1967    3   18  1,673
1967    11  12  1,724
1968    1   4   1,6355
1968    1   15  1,6567
1968    3   6   1,1587
1968    3   18  1,212
1969    9   29  1,5613
1969    10  1   1,5016
1969    11  20  1,9304
1969    11  29  1,9279
1970    10  3   1,9859
1970    10  28  1,4065
1970    11  4   1,4227
1970    11  9   1,7901

B = 1954    12  28  774
1954    12  29  734
1955    3   26  712
1955    3   27  648
1956    7   18  1030
1956    7   23  1090
1957    2   17  549
1957    2   28  549
1958    2   27  759
1958    2   28  798
1959    1   10  421
1959    1   24  419
1960    12  5   762
1960    12  8   829
1961    2   12  788
1961    2   13  776
1962    2   15  628
1962    4   9   628
1963    3   12  552
1963    3   13  552
1964    2   12  260
1964    2   13  253
1965    12  22  862
1965    12  23  891
1966    1   5   828
1966    12  27  802
1967    1   1   777
1967    1   2   787
1968    1   17  981
1968    1   18  932
1969    3   15  511
1969    3   16  546
1970    2   25  1030
1970    2   26  1030

The expected output is a new matrix C:

C = 1954    12  22  2,7411  1954    12  28  774
1959    1   3   1,7784  1959    1   10  421
1959    1   24  1,1871  1959    1   24  419
1960    12  4   1,7213  1960    12  8   829
1962    2   12  2,1813  1962    2   15  628
1966    12  24  1,8173  1966    12  27  802
1968    1   15  1,6567  1968    1   17  981

Any help how to code this?

Upvotes: 0

Views: 72

Answers (1)

DoMakeSayThink
DoMakeSayThink

Reputation: 165

I think the following should do what you want -

To deal with overlaps at year and month boundaries, it's useful to have the dates mapped to number of days since an epoch. The first function finds the earliest data in either dataset, and then formats it to be interpreted by the 'daysact' function.

function epoch_date_str = get_epoch_datestr(A,B)
    Astr = int2str(A(:,1:3));
    Bstr = int2str(B(:,1:3));
    [epoch_Ay, epoch_Am, epoch_Ad]  = earliest_date(A);
    [epoch_By, epoch_Bm, epoch_Bd]  = earliest_date(B);
    [epoch_y, epoch_m, epoch_d] = earliest_date([epoch_Ay, epoch_Am, epoch_Ad; epoch_By, epoch_Bm, epoch_Bd]);
    epoch_str = int2str([epoch_y, epoch_m, epoch_d]);
    epoch_date_str = regexprep(epoch_str,'\s+','/')
end

This function then does the calculation of the number of days from the epoch to each date in the dataset, it's basically just wrangling data into a format accepted by the daysact function.

function ndays = days_since_epoch(A, epoch_date_str)
    ndays = zeros(size(A,1),1);
    Astr = int2str(A(:,1:3));
    for i=1:size(Astr,1)
        ndays(i) = daysact(epoch_date_str, regexprep(Astr(i,:),'\s+','/'));
    end
end

And now we can get on with the actual calculations - I was a bit confused by the fifth column in the 'A' matrix you presented, I assume that is the score, but if not it's configured by the A_MATRIX_SCORE_COL variable. Similarly the 6 day window is configured by the WINDOW_SIZE.

ep_str = get_epoch_datestr(A,B);
ndaysA = days_since_epoch(A, ep_str);
ndaysB = days_since_epoch(B, ep_str);
C = [];

WINDOW_SIZE= 6;
A_MATRIX_SCORE_COL = 5;
for i=1:length(B)
    % Find dates within the date window
    overlaps = find(ndaysA >= (ndaysB(i) - window_size ) & (ndaysA <= (ndaysB(i) + window_size )));
    % If there are multiple matches, choose the highest and append to C 
    if (length(overlaps) > 0)
      [~, max_idx] = max(A(overlaps,A_MATRIX_SCORE_COL));
      match_row = overlaps(max_idx);
      C = [C; A(match_row,:) B(i,:)];
    end
end
C = unique(C,'rows');

The output I get differs from yours:

C =

   1954     12     22      2   7411   1954     12     28    774
   1959      1     24      1   1871   1959      1     24    419
   1960     12      4      1   7213   1960     12      5    762
   1960     12      4      1   7213   1960     12      8    829
   1962      2     16      3   5776   1962      2     15    628
   1966     12     24      1   8173   1966     12     27    802
   1968      1     15      1   6567   1968      1     17    981
   1968      1     15      1   6567   1968      1     18    932

But your second row has a difference of 7 days, so I wouldn't expect it to be found. It can be included by increasing the window_size to 7.

As you can see, it's possible for a row in A to be included twice in C if it matches more than one date in B. This could be easily filtered from C if you want:

D = []
for i = 1:size(C,1)
    % Find matching dates from A. Due to the way C was built, there won't be duplicates from B.
    dupes = find((C(:,1) == C(i,1) & C( :,2) == C(i,2) & C( :,3) == C(i,3)))
    % If there's only one match (i.e. it matches itself), then add to D
    if (length(dupes) == 1)
        D = [D; C(i,:)]
    else
    % If there are duplicates, then compare the scores from B and only add the highest score to D.
        best = true;
        for j=1:length(dupes)
            if C(i,end) < C(dupes(j),end)
                best = false;
            end
        end
        if (best == true)
            D = [D; C(i,:)]
        end
    end
end

The matrix 'D' is then your de-duplicated output.

Upvotes: 0

Related Questions