Reputation: 31
I’m trying to solve the following problem, and I need to do it as efficiently as possible (i.e. trying to avoid loops as much as I can).
I have two cell arrays, namely A and B. Each cell of A and B contains a string of characters. The length of these strings of characters is variable. Let’s say:
A={‘life is wonderful’, ‘matlab makes your dreams come true’};
B={‘life would be meaningless without wonderful matlab’, ‘what a wonderful world’, ‘the shoemaker makes shoes’, ‘rock and roll baby’};
Moreover, the number of elements of cell array B is around three orders of magnitude larger than that of cell array A.
My goal is to find how many words of every char string in A also appear in every char string of B.
For the previous example, a suitable result could be something like:
match = [2 1 0 0
1 0 1 0]
The first row indicates how many words in the 1st char string of A appear in the four char strings of B. And the second row, the same for the 2nd char string of A.
The double loop implementation is straightforward, but very time consuming especially because of the length of cell array B (over 3 million cells).
Any ideas? Thank you very much.
Xavier
Upvotes: 3
Views: 2606
Reputation: 3119
You can exploit Map, which offers you an efficient dictionary based structure:
For each word save the vector showing the occurrences in each string:
A = {'life is wonderful', 'matlab makes your dreams come true'};
B = {'life would be meaningless without wonderful matlab', 'what a wonderful world', 'the shoemaker makes shoes', 'rock and roll baby'};
mapA = containers.Map();
sizeA = size(A,2);
for i = 1:size(A,2) % for each string
a = regexpi(A(i),'\w+','match');
for w = a{:} % for each word extracted
str = cell2mat(w);
if(mapA.isKey(str)) % if word already indexed
occ = mapA(str);
else % new key
occ = zeros(1,sizeA);
end
occ(i) = occ(i)+1;
mapA(str) = occ;
end
end
% same for B
mapB = containers.Map();
sizeB = size(B,2);
for i = 1:size(B,2)
a = regexpi(B(i),'\w+','match');
for w = a{:}
str = cell2mat(w);
if(mapB.isKey(str))
occ = mapB(str);
else
occ = zeros(1,sizeB);
end
occ(i) = occ(i)+1;
mapB(str) = occ;
end
end
then, for each unique word found in A, compute the matches with B
match = zeros(size(A,2),size(B,2));
for w = mapA.keys
str = cell2mat(w);
if (mapB.isKey(str))
match = match + diag(mapA(str))*ones(size(match))*diag(mapB(str));
end
end
Result:
match =
2 1 0 0
1 0 1 0
this way you have a complexity of #wordsA + #wordsB + #singleWordsA instead of #wordsA*#wordsB
EDIT: or, if you don't like Map
, you can save the word-occurrence-vectors in alphabetically ordered vector. Then you can look for matches simultaneously checking both vectors:
(suppose we are using a struct where the w
attribute is the word string and occ
is the occurrence vector)
i = 1; j = 1;
while(i<=size(wordsA,2) && i<=size(wordsB,2))
if(strcmp(wordsA(i).w, wordsB(j).w))
% update match
else
if(before(wordsA(i).w, wordsA(i).w)) % before: fancy function returning 1 if the first argument comes (alphabetically) before the second one (no builtin function comes to my mind)
i = i+1;
else
j = j+1;
end
end
if you are looking for 'matlab' and you know in the 10th position is stored 'life' is useless to check the positions before, since the vector is alphabetically ordered. So we have #wordsA+#wordsB iteration vs. #wordsA*#wordsB of the nested loops solution.
Upvotes: 1
Reputation: 4732
The solution is not complex.
split the sentences apart with:
a_words = regexp(A,'(\w+)','match')
b_words = regexp(B,'(\w+)','match')
then compare in a loop:
match = nan(numel(a_words),numel(b_words));
for i = 1:numel(a_words)
for j = 1:numel(b_words)
match(i,j) = sum(ismember(a_words{i},b_words{j}));
end
end
but to make it quicker - I'm not really sure. You definetly can put the inner loop in a parfor that should parallelize that. If it's really a lot of words maybe put them in a database. That will do the indexing for you.
Upvotes: 2
Reputation: 124563
Let me get this started by posting the "straightforward" solution (at least others have a baseline to compare against):
A = {'life is wonderful', 'matlab makes your dreams come true'};
B = {'life would be meaningless without wonderful matlab', 'what a wonderful world', 'the shoemaker makes shoes', 'rock and roll baby'};
count = zeros(numel(A),numel(B));
%# for each string
for i=1:numel(A)
%# split into words
str = textscan(A{i}, '%s', 'Delimiter',' '); str = str{1};
%# for each word
for j=1:numel(str)
%# count occurences
count(i,:) = count(i,:) + cellfun(@numel, strfind(B,str{j}));
end
end
The result:
>> count
count =
2 1 0 0
1 0 1 0
A better algorithm could be building some kind of index or hash-table...
Upvotes: 2