Reputation: 7004
THIS QUESTION IS NOT ONLY FOR MATLAB USERS - If you know an answer to the problem in PSEUDOCODE, then feel free to leave your answer as well!
I have two tables Ta and Tb that have different number of rows and different number of columns. The content is all cell text, but maybe in the future it could also contain cell number.
I want to merge the content of these tables together under the following set of rules:
Ta(i,j)
if Tb(i*,j*)
is empty and vice versa.Ta(i,j)
(and optionally, check whether they are the same).The tricky part however is that we do not have unique row keys, we only have unique column keys. Note above that I make a distinction between i*
and i
. The reason is that the row in Ta can be at a different index than Tb, same holds for the columns j*
and j
. The implications are:
Question
How can we merge the content of these two tables together in the most efficient way?
Here are some resources to explain the question in more detail:
1. Matlab example to play with:
Ta = cell2table({...
'a1', 'b1', 'c1'; ...
'a2', 'b2', 'c2'}, ...
'VariableNames', {'A','B', 'C'})
Tb = cell2table({...
'b2*', 'c2', 'd2'; ...
'b3', 'c3', 'd3'; ...
'b4', 'c4', 'd4'}, ...
'VariableNames', {'B','C', 'D'})
The resulting table Tc should be something like this:
Tc = cell2table({...
'a1' 'b1' 'c1' ''; ...
'a2' 'b2' 'c2' 'd2'; ...
'' 'b3' 'c3' 'd3'; ...
'' 'b4' 'c4' 'd4'}, ...
'VariableNames', {'A', 'B','C', 'D'})
2. A possible first step
I tried the following:
Tc = outerjoin(Ta, Tb, 'MergeKeys', true)
Which works smooth, but the problem is that it lacks the stacking of rows that seem similar. E.g. the above command produces:
A B C D
____ _____ ____ ____
'' 'b2*' 'c2' 'd2'
'' 'b3' 'c3' 'd3'
'' 'b4' 'c4' 'd4'
'a1' 'b1' 'c1' ''
'a2' 'b2' 'c2' ''
Here the rows
'' 'b2*' 'c2' 'd2'
'a2' 'b2' 'c2' ''
Should have been merged into one:
'a2' 'b2' 'c2' 'd2'
So we need one more step to stack those two together?
3. Example of a hurdle
If we have something like:
Ta =
A B C
____ _____ ____
'a1' 'b1' 'c1'
'a2' 'b2' 'c2'
Tb =
A B C
____ _____ ____
'a1' 'b2' 'c3'
then the question arises whether the row in b should be merged with row 1 or row 2 of a or should all rows be merged or just put as a separate row? An idea on how to handle these type of situations would be nice as well.
Upvotes: 3
Views: 496
Reputation: 1426
Here is a function that attempts to do the job. You feed in the two tables, a threshold by which you decide whether you merge two rows, and a logical to state whether you prefer to take values from the first table when merge conflict emerges. I did not prepare for extreme cases but see where it gets you with:
TkeepAll=mergeTables(Tb,Ta,1,true)
TmergeSome=mergeTables(Tb,Ta,0.25,true)
TmergeAll=mergeTables(Tb,Ta,-1,true)
here is the function:
function Tmerged=mergeTables(Ta,Tb,threshold,preferA)
%% parameters
% Ta and Tb are two the two tables to merge
% threshold=0.25; minimal ratio of identical values in rows for merge.
% example: you have one row in table A with 3 values, but you only have two
% values for the same columns in data B. if one of the values is identical
% and one isn't, you have ratio of 1/2 aka 0.5, which passes a threshold of
% 0.25
% preferA=true; which to take when there is merge conflict
%% see how well rows fit to each other
% T1 is the table with fewer rows
if size(Ta,1)<=size(Tb,1)
T1=Ta;
T2=Tb;
prefer1=preferA;
else
T1=Tb;
T2=Ta;
prefer1=~preferA;
end
[commonVar1,commonVar2]=ismember(T1.Properties.VariableNames,...
T2.Properties.VariableNames);
commonVar1=find(commonVar1);
commonVar2(commonVar2==0)=[];
% fit is a table with the size of N rows T1 by M rows T2, with values
% describing what ratio of identical items between each row in
% table 1 (shorter) and each row in table 2 (longer), among all not-missing
% points
for ii=1:size(T1,1) %rows of T1
for jj=1:size(T2,1)
fit(ii,jj)=sum(ismember(T1{ii,commonVar1},T2{jj,commonVar2}))/length(commonVar1);
end
end
%% pair rows according to fit
% match has two columns, first one has T1 row number and secone one has the
% matching T2 row number
unpaired1=true(size(T1,1),1);
unpaired2=true(size(T2,1),1);
count=0;
match=[];
maxv=max(fit,[],2);
[~,order]=sort(maxv,'descend');
order=order';
for ii=order %1:size(T1,1)
[maxv,maxi]=max(fit,[],2);
if maxv(ii)>threshold
count=count+1;
match(count,1)=ii;
match(count,2)=maxi(ii);
unpaired1(ii)=false;
unpaired2(match(count,2))=false;
fit(:,match(count,2))=nan; %exclude paired row from next pairing
end
end
%% prepare new variables
% first variables common to the two tables
Nrows=sum(unpaired1)+sum(unpaired2)+size(match,1);
namesCommon={};
namesCommon(1:length(commonVar1))={T1.Properties.VariableNames{commonVar1}};
for vari=1:length(commonVar1)
if isempty(match)
mergedData={};
else
if prefer1
mergedData=T1{match(:,1),commonVar1(vari)}; %#ok<*NASGU>
else
mergedData=T2{match(:,2),commonVar2(vari)};
end
end
data1=T1{unpaired1,commonVar1(vari)};
data2=T2{unpaired2,commonVar2(vari)};
eval([namesCommon{vari},'=[data1;mergedData;data2];']);
end
% variables only in 1
uncommonVar1=1:size(T1,2);
uncommonVar1(commonVar1)=[];
names1={};
names1(1:length(uncommonVar1))={T1.Properties.VariableNames{uncommonVar1}};
for vari=1:length(uncommonVar1)
data1=T1{:,uncommonVar1(vari)};
tmp=repmat({''},Nrows-size(data1,1),1);
eval([names1{vari},'=[data1;tmp];']);
end
% variables only in 2
uncommonVar2=1:size(T2,2);
uncommonVar2(commonVar2)=[];
names2={};
names2(1:length(uncommonVar2))={T2.Properties.VariableNames{uncommonVar2}};
for vari=1:length(uncommonVar2)
data2=T2{:,uncommonVar2(vari)};
tmp=repmat({''},Nrows-size(data2,1),1);
eval([names2{vari},'=[tmp;data2];']);
end
%% collect variables to a table
names=sort([namesCommon,names1,names2]);
str='table(';
for vari=1:length(names)
str=[str,names{vari},','];
end
str=[str(1:end-1),');'];
Tmerged=eval(str);
Upvotes: 3
Reputation: 21563
Here is a conceptual answer, that could get you on the way:
Room for improvement:
Play around with consuming Ta instead of Tb, or use a more complex heuristic to determine the consumption order (e.g. calculate all 'distances' and optimize the matching based on a cost function).
Note that these improvements are only neccesary if you get a lot of false positives with your matches in the basic solution.
I would recommend you to start very simple with this, for instance if you have 4 fields, simply count how many fields match, or whether all nonempty fields match.
If you want to go further, consider evaluating how far the values are apart (e.g. mse) or how far the texts are apart (e.g. levensteihn distance).
Upvotes: 3