Reputation: 25
I have two matrices A and B. Both have different sizes and 1st, 2nd, 3rd, & 4th value show year, month, day and values in both matrices. I need to extract rows with same year and month however, day of +/-6 days from matrix A and related rows form matrix B. If two or more days are close in matrices A & B, I should choose the rows corresponding to highest value from both matrices.
A = 1954 1 16 2,3042
1954 12 5 2,116
1954 12 21 1,9841
1954 12 22 2,7411
1955 1 13 1,8766
1955 10 16 1,4003
1955 12 29 1,4979
1956 1 19 2,1439
1956 1 21 1,7666
1956 11 26 1,7367
1956 11 27 1,8914
1957 1 27 1,151
1957 2 2 1,1484
1957 12 29 1,1906
1957 12 30 1,3157
1958 1 10 1,6186
1958 1 20 1,1637
1958 2 6 1,1639
1958 10 16 1,1444
1959 1 3 1,7784
1959 1 24 1,1871
1959 2 20 1,2264
1959 10 25 1,2194
1960 6 29 1,2327
1960 12 4 1,7213
1960 12 5 1,373
1961 3 21 1,7149
1961 3 27 1,4404
1961 11 3 1,3934
1961 12 5 1,777
1962 2 12 2,1813
1962 2 16 3,5776
1962 2 17 1,9236
1963 9 27 1,6164
1963 10 13 1,786
1963 10 14 1,9203
1963 11 22 1,7575
1964 2 2 1,4402
1964 11 15 1,437
1964 11 17 1,7588
1964 12 4 1,6358
1965 2 13 1,874
1965 11 2 2,6468
1965 11 26 1,7163
1965 12 11 1,8283
1966 12 1 2,1165
1966 12 19 1,6672
1966 12 24 1,8173
1966 12 25 1,4923
1967 2 23 2,3002
1967 3 1 1,9614
1967 3 18 1,673
1967 11 12 1,724
1968 1 4 1,6355
1968 1 15 1,6567
1968 3 6 1,1587
1968 3 18 1,212
1969 9 29 1,5613
1969 10 1 1,5016
1969 11 20 1,9304
1969 11 29 1,9279
1970 10 3 1,9859
1970 10 28 1,4065
1970 11 4 1,4227
1970 11 9 1,7901
B = 1954 12 28 774
1954 12 29 734
1955 3 26 712
1955 3 27 648
1956 7 18 1030
1956 7 23 1090
1957 2 17 549
1957 2 28 549
1958 2 27 759
1958 2 28 798
1959 1 10 421
1959 1 24 419
1960 12 5 762
1960 12 8 829
1961 2 12 788
1961 2 13 776
1962 2 15 628
1962 4 9 628
1963 3 12 552
1963 3 13 552
1964 2 12 260
1964 2 13 253
1965 12 22 862
1965 12 23 891
1966 1 5 828
1966 12 27 802
1967 1 1 777
1967 1 2 787
1968 1 17 981
1968 1 18 932
1969 3 15 511
1969 3 16 546
1970 2 25 1030
1970 2 26 1030
The expected output is a new matrix C:
C = 1954 12 22 2,7411 1954 12 28 774
1959 1 3 1,7784 1959 1 10 421
1959 1 24 1,1871 1959 1 24 419
1960 12 4 1,7213 1960 12 8 829
1962 2 12 2,1813 1962 2 15 628
1966 12 24 1,8173 1966 12 27 802
1968 1 15 1,6567 1968 1 17 981
Any help how to code this?
Upvotes: 0
Views: 72
Reputation: 165
I think the following should do what you want -
To deal with overlaps at year and month boundaries, it's useful to have the dates mapped to number of days since an epoch. The first function finds the earliest data in either dataset, and then formats it to be interpreted by the 'daysact' function.
function epoch_date_str = get_epoch_datestr(A,B)
Astr = int2str(A(:,1:3));
Bstr = int2str(B(:,1:3));
[epoch_Ay, epoch_Am, epoch_Ad] = earliest_date(A);
[epoch_By, epoch_Bm, epoch_Bd] = earliest_date(B);
[epoch_y, epoch_m, epoch_d] = earliest_date([epoch_Ay, epoch_Am, epoch_Ad; epoch_By, epoch_Bm, epoch_Bd]);
epoch_str = int2str([epoch_y, epoch_m, epoch_d]);
epoch_date_str = regexprep(epoch_str,'\s+','/')
end
This function then does the calculation of the number of days from the epoch to each date in the dataset, it's basically just wrangling data into a format accepted by the daysact
function.
function ndays = days_since_epoch(A, epoch_date_str)
ndays = zeros(size(A,1),1);
Astr = int2str(A(:,1:3));
for i=1:size(Astr,1)
ndays(i) = daysact(epoch_date_str, regexprep(Astr(i,:),'\s+','/'));
end
end
And now we can get on with the actual calculations - I was a bit confused by the fifth column in the 'A' matrix you presented, I assume that is the score, but if not it's configured by the A_MATRIX_SCORE_COL
variable. Similarly the 6 day window is configured by the WINDOW_SIZE
.
ep_str = get_epoch_datestr(A,B);
ndaysA = days_since_epoch(A, ep_str);
ndaysB = days_since_epoch(B, ep_str);
C = [];
WINDOW_SIZE= 6;
A_MATRIX_SCORE_COL = 5;
for i=1:length(B)
% Find dates within the date window
overlaps = find(ndaysA >= (ndaysB(i) - window_size ) & (ndaysA <= (ndaysB(i) + window_size )));
% If there are multiple matches, choose the highest and append to C
if (length(overlaps) > 0)
[~, max_idx] = max(A(overlaps,A_MATRIX_SCORE_COL));
match_row = overlaps(max_idx);
C = [C; A(match_row,:) B(i,:)];
end
end
C = unique(C,'rows');
The output I get differs from yours:
C =
1954 12 22 2 7411 1954 12 28 774
1959 1 24 1 1871 1959 1 24 419
1960 12 4 1 7213 1960 12 5 762
1960 12 4 1 7213 1960 12 8 829
1962 2 16 3 5776 1962 2 15 628
1966 12 24 1 8173 1966 12 27 802
1968 1 15 1 6567 1968 1 17 981
1968 1 15 1 6567 1968 1 18 932
But your second row has a difference of 7 days, so I wouldn't expect it to be found. It can be included by increasing the window_size to 7.
As you can see, it's possible for a row in A to be included twice in C if it matches more than one date in B. This could be easily filtered from C if you want:
D = []
for i = 1:size(C,1)
% Find matching dates from A. Due to the way C was built, there won't be duplicates from B.
dupes = find((C(:,1) == C(i,1) & C( :,2) == C(i,2) & C( :,3) == C(i,3)))
% If there's only one match (i.e. it matches itself), then add to D
if (length(dupes) == 1)
D = [D; C(i,:)]
else
% If there are duplicates, then compare the scores from B and only add the highest score to D.
best = true;
for j=1:length(dupes)
if C(i,end) < C(dupes(j),end)
best = false;
end
end
if (best == true)
D = [D; C(i,:)]
end
end
end
The matrix 'D' is then your de-duplicated output.
Upvotes: 0