user1035491
user1035491

Reputation: 31

MATLAB: words matching between cell arrays of strings

I’m trying to solve the following problem, and I need to do it as efficiently as possible (i.e. trying to avoid loops as much as I can).

I have two cell arrays, namely A and B. Each cell of A and B contains a string of characters. The length of these strings of characters is variable. Let’s say:

A={‘life is wonderful’, ‘matlab makes your dreams come true’};

B={‘life would be meaningless without wonderful matlab’, ‘what a wonderful world’, ‘the shoemaker makes shoes’, ‘rock and roll baby’};

Moreover, the number of elements of cell array B is around three orders of magnitude larger than that of cell array A.

My goal is to find how many words of every char string in A also appear in every char string of B.

For the previous example, a suitable result could be something like:

match = [2 1 0 0
1 0 1 0]

The first row indicates how many words in the 1st char string of A appear in the four char strings of B. And the second row, the same for the 2nd char string of A.

The double loop implementation is straightforward, but very time consuming especially because of the length of cell array B (over 3 million cells).

Any ideas? Thank you very much.

Xavier

Upvotes: 3

Views: 2606

Answers (3)

Cavaz
Cavaz

Reputation: 3119

You can exploit Map, which offers you an efficient dictionary based structure:

For each word save the vector showing the occurrences in each string:

A = {'life is wonderful', 'matlab makes your dreams come true'};
B = {'life would be meaningless without wonderful matlab', 'what a wonderful world', 'the shoemaker makes shoes', 'rock and roll baby'};

mapA = containers.Map();
sizeA = size(A,2);
for i = 1:size(A,2)         % for each string
    a = regexpi(A(i),'\w+','match');
    for w = a{:}                % for each word extracted
        str = cell2mat(w);
        if(mapA.isKey(str))     % if word already indexed
            occ = mapA(str);
        else                    % new key
            occ = zeros(1,sizeA);
        end
        occ(i) = occ(i)+1;
        mapA(str) = occ;
    end
end

% same for B
mapB = containers.Map();
sizeB = size(B,2);
for i = 1:size(B,2) 
    a = regexpi(B(i),'\w+','match');
    for w = a{:}
        str = cell2mat(w);
        if(mapB.isKey(str))
            occ = mapB(str);
        else
            occ = zeros(1,sizeB);
        end
        occ(i) = occ(i)+1;
        mapB(str) = occ;
    end
end

then, for each unique word found in A, compute the matches with B

match = zeros(size(A,2),size(B,2));
for w = mapA.keys
    str = cell2mat(w);
    if (mapB.isKey(str))
        match = match + diag(mapA(str))*ones(size(match))*diag(mapB(str));
    end
end

Result:

match =

     2     1     0     0
     1     0     1     0

this way you have a complexity of #wordsA + #wordsB + #singleWordsA instead of #wordsA*#wordsB

EDIT: or, if you don't like Map, you can save the word-occurrence-vectors in alphabetically ordered vector. Then you can look for matches simultaneously checking both vectors:

(suppose we are using a struct where the w attribute is the word string and occ is the occurrence vector)

i = 1; j = 1;
while(i<=size(wordsA,2) && i<=size(wordsB,2))
if(strcmp(wordsA(i).w, wordsB(j).w))
    % update match
else
    if(before(wordsA(i).w, wordsA(i).w)) % before: fancy function returning 1 if the first argument comes (alphabetically) before the second one (no builtin function comes to my mind)
        i = i+1;
    else
        j = j+1;
    end
end

if you are looking for 'matlab' and you know in the 10th position is stored 'life' is useless to check the positions before, since the vector is alphabetically ordered. So we have #wordsA+#wordsB iteration vs. #wordsA*#wordsB of the nested loops solution.

Upvotes: 1

bdecaf
bdecaf

Reputation: 4732

The solution is not complex.

split the sentences apart with:

a_words = regexp(A,'(\w+)','match')
b_words = regexp(B,'(\w+)','match')

then compare in a loop:

match = nan(numel(a_words),numel(b_words));
for i = 1:numel(a_words)
    for j = 1:numel(b_words)
        match(i,j) = sum(ismember(a_words{i},b_words{j}));
    end
end

but to make it quicker - I'm not really sure. You definetly can put the inner loop in a parfor that should parallelize that. If it's really a lot of words maybe put them in a database. That will do the indexing for you.

Upvotes: 2

Amro
Amro

Reputation: 124563

Let me get this started by posting the "straightforward" solution (at least others have a baseline to compare against):

A = {'life is wonderful', 'matlab makes your dreams come true'};
B = {'life would be meaningless without wonderful matlab', 'what a wonderful world', 'the shoemaker makes shoes', 'rock and roll baby'};

count = zeros(numel(A),numel(B));

%# for each string
for i=1:numel(A)
    %# split into words
    str = textscan(A{i}, '%s', 'Delimiter',' '); str = str{1};

    %# for each word
    for j=1:numel(str)
        %# count occurences
        count(i,:) = count(i,:) + cellfun(@numel, strfind(B,str{j}));
    end
end

The result:

>> count
count =
     2     1     0     0
     1     0     1     0

A better algorithm could be building some kind of index or hash-table...

Upvotes: 2

Related Questions