WJA
WJA

Reputation: 7004

Merge the content of two tables (looking for Matlab or Pseudo Code)

THIS QUESTION IS NOT ONLY FOR MATLAB USERS - If you know an answer to the problem in PSEUDOCODE, then feel free to leave your answer as well!


I have two tables Ta and Tb that have different number of rows and different number of columns. The content is all cell text, but maybe in the future it could also contain cell number.

I want to merge the content of these tables together under the following set of rules:

The tricky part however is that we do not have unique row keys, we only have unique column keys. Note above that I make a distinction between i* and i. The reason is that the row in Ta can be at a different index than Tb, same holds for the columns j* and j. The implications are:


Question

How can we merge the content of these two tables together in the most efficient way?


Here are some resources to explain the question in more detail:

1. Matlab example to play with:

Ta = cell2table({...
     'a1', 'b1', 'c1'; ...
     'a2', 'b2', 'c2'}, ...
      'VariableNames', {'A','B', 'C'})
Tb = cell2table({...
     'b2*', 'c2', 'd2'; ...
     'b3', 'c3', 'd3'; ...
     'b4', 'c4', 'd4'}, ...
      'VariableNames', {'B','C', 'D'})

The resulting table Tc should be something like this:

Tc = cell2table({...
    'a1' 'b1' 'c1'   ''; ...
    'a2' 'b2' 'c2' 'd2'; ...
    ''   'b3' 'c3' 'd3'; ...
    ''   'b4' 'c4' 'd4'}, ...
     'VariableNames', {'A', 'B','C', 'D'})

2. A possible first step

I tried the following:

Tc = outerjoin(Ta, Tb, 'MergeKeys', true)

Which works smooth, but the problem is that it lacks the stacking of rows that seem similar. E.g. the above command produces:

 A        B       C       D  
____    _____    ____    ____
''      'b2*'    'c2'    'd2'
''      'b3'     'c3'    'd3'
''      'b4'     'c4'    'd4'
'a1'    'b1'     'c1'    ''  
'a2'    'b2'     'c2'    '' 

Here the rows

''      'b2*'    'c2'    'd2'
'a2'    'b2'     'c2'    '' 

Should have been merged into one:

'a2'    'b2'     'c2'    'd2' 

So we need one more step to stack those two together?


3. Example of a hurdle

If we have something like:

Ta = 
     A        B       C       
    ____    _____    ____
    'a1'    'b1'     'c1' 
    'a2'    'b2'     'c2'

Tb = 
     A        B       C       
    ____    _____    ____
    'a1'    'b2'     'c3' 

then the question arises whether the row in b should be merged with row 1 or row 2 of a or should all rows be merged or just put as a separate row? An idea on how to handle these type of situations would be nice as well.

Upvotes: 3

Views: 496

Answers (2)

Yuval Harpaz
Yuval Harpaz

Reputation: 1426

Here is a function that attempts to do the job. You feed in the two tables, a threshold by which you decide whether you merge two rows, and a logical to state whether you prefer to take values from the first table when merge conflict emerges. I did not prepare for extreme cases but see where it gets you with:

TkeepAll=mergeTables(Tb,Ta,1,true)
TmergeSome=mergeTables(Tb,Ta,0.25,true)
TmergeAll=mergeTables(Tb,Ta,-1,true)

here is the function:

function Tmerged=mergeTables(Ta,Tb,threshold,preferA)
%% parameters
% Ta and Tb are two the two tables to merge
% threshold=0.25; minimal ratio of identical values in rows for merge.
%   example: you have one row in table A with 3 values, but you only have two
%   values for the same columns in data B. if one of the values is identical
%   and one isn't, you have ratio of 1/2 aka 0.5, which passes a threshold of
%   0.25
% preferA=true; which to take when there is merge conflict
%% see how well rows fit to each other
% T1 is the table with fewer rows
if size(Ta,1)<=size(Tb,1)
    T1=Ta;
    T2=Tb;
    prefer1=preferA;
else
    T1=Tb;
    T2=Ta;
    prefer1=~preferA;
end
[commonVar1,commonVar2]=ismember(T1.Properties.VariableNames,...
    T2.Properties.VariableNames);
commonVar1=find(commonVar1);
commonVar2(commonVar2==0)=[];
% fit is a table with the size of N rows T1 by M rows T2, with values
% describing what ratio of identical items between each row in
% table 1 (shorter) and each row in table 2 (longer), among all not-missing
% points
for ii=1:size(T1,1) %rows of T1
    for jj=1:size(T2,1)
        fit(ii,jj)=sum(ismember(T1{ii,commonVar1},T2{jj,commonVar2}))/length(commonVar1);
    end
end
%% pair rows according to fit
% match has two columns, first one has T1 row number and secone one has the
% matching T2 row number
unpaired1=true(size(T1,1),1);
unpaired2=true(size(T2,1),1);
count=0;
match=[];
maxv=max(fit,[],2);
[~,order]=sort(maxv,'descend');
order=order';
for ii=order %1:size(T1,1)
    [maxv,maxi]=max(fit,[],2);
    if maxv(ii)>threshold
        count=count+1;
        match(count,1)=ii;
        match(count,2)=maxi(ii);
        unpaired1(ii)=false;
        unpaired2(match(count,2))=false;
        fit(:,match(count,2))=nan; %exclude paired row from next pairing
    end
end

%% prepare new variables
% first variables common to the two tables
Nrows=sum(unpaired1)+sum(unpaired2)+size(match,1);
namesCommon={};
namesCommon(1:length(commonVar1))={T1.Properties.VariableNames{commonVar1}};
for vari=1:length(commonVar1)
    if isempty(match)
        mergedData={};
    else
        if prefer1
            mergedData=T1{match(:,1),commonVar1(vari)}; %#ok<*NASGU>
        else
            mergedData=T2{match(:,2),commonVar2(vari)};
        end
    end
    data1=T1{unpaired1,commonVar1(vari)};
    data2=T2{unpaired2,commonVar2(vari)};
    eval([namesCommon{vari},'=[data1;mergedData;data2];']);
end
% variables only in 1
uncommonVar1=1:size(T1,2);
uncommonVar1(commonVar1)=[];
names1={};
names1(1:length(uncommonVar1))={T1.Properties.VariableNames{uncommonVar1}};
for vari=1:length(uncommonVar1)
    data1=T1{:,uncommonVar1(vari)};
    tmp=repmat({''},Nrows-size(data1,1),1);
    eval([names1{vari},'=[data1;tmp];']);
end
% variables only in 2
uncommonVar2=1:size(T2,2);
uncommonVar2(commonVar2)=[];
names2={};
names2(1:length(uncommonVar2))={T2.Properties.VariableNames{uncommonVar2}};
for vari=1:length(uncommonVar2)
    data2=T2{:,uncommonVar2(vari)};
    tmp=repmat({''},Nrows-size(data2,1),1);
    eval([names2{vari},'=[tmp;data2];']);
end
%% collect variables to a table
names=sort([namesCommon,names1,names2]);
str='table(';
for vari=1:length(names)
    str=[str,names{vari},','];
end
str=[str(1:end-1),');'];
Tmerged=eval(str);

Upvotes: 3

Dennis Jaheruddin
Dennis Jaheruddin

Reputation: 21563

Here is a conceptual answer, that could get you on the way:

  1. Define a 'scoring function' that tells you per row of Tb how good it matches a row in Ta.
  2. Fill Tc with Ta
  3. For each row in Ta, determine the best match with Tb. If the match quality is above your criterium, define the best match match to be a succesfull match.
  4. If a succesfull match was found, 'consume' it (use the info from Tb to enrich the corresponding row in Tc where needed)
  5. Keep going till you reach the end of Ta, whatever has not been consumed from Tb can now be 'appended' to Tc.

Room for improvement:

Note on choice of matches

Play around with consuming Ta instead of Tb, or use a more complex heuristic to determine the consumption order (e.g. calculate all 'distances' and optimize the matching based on a cost function).

Note that these improvements are only neccesary if you get a lot of false positives with your matches in the basic solution.

Note on definition of match quality

I would recommend you to start very simple with this, for instance if you have 4 fields, simply count how many fields match, or whether all nonempty fields match.

If you want to go further, consider evaluating how far the values are apart (e.g. mse) or how far the texts are apart (e.g. levensteihn distance).

Upvotes: 3

Related Questions