Finding and counting strings using multiple index vectors

Question

I have a character array (this can also be stored as a cell array if more useful) (list) and wish to tally the number of substring occurrences against two different indexes held in two separate variables type and ind.

list =
C C N N C U C N N N C N U N C N C

ind =
1 1 2 2 2 3 3 3 4 1 1 2 3 3 3 4 4 

type = 
15 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16

No spaces exist in the character array - added for clarity.

Using the above example, the desired output would tally all instances of unique letters in list, for each ind and for each type - creating three columns (for C/N/U), each with 4 rows (for each ind) - per type. This is done using the order in which the entries in each array appear.

Desired output of above example (the labels are added for clarity only):

            Type 15              Type 16
   Ind  C      N      U      C      N      U
    1   2      0      0      1      1      0
    2   1      2      0      0      1      0
    3   1      1      1      1      1      1
    4   0      1      0      1      1      0

I am only aware of how to do this with a single index (using unique, full and sparse).

How can I bet go about doing this with a dual index?

Robert Seifert · Accepted Answer

One possibility could be to transform your letters to doubles by substracting e.g. -64 to map the number 3 to the letter C.

Then you can use unique with 'rows' and 'stable', to get the following result:

list = char('CCNNCUCNNNCNUNCNC')
ind = [1 1 2 2 2 3 3 3 4 1 1 2 3 3 3 4 4]
type = [15 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16]

data = [type(:) ind(:) (list(:) - 64)]
[a,~,c] = unique(data,'rows','stable')
occ = accumarray(c,ones(size(c)),[],@numel)

output = [a, occ]

output =

    15     1     3     2
    15     2    14     2
    15     2     3     1
    15     3    21     1
    15     3     3     1
    15     3    14     1
    15     4    14     1
    16     1    14     1
    16     1     3     1
    16     2    14     1
    16     3    21     1
    16     3    14     1
    16     3     3     1
    16     4    14     1
    16     4     3     1

If you have the Statistics Toolbox you should consider using grpstats.

If you don't mind a mind twisting output then crosstab is the far easiest solution:

output = crosstab(type(:),ind(:),list(:)-64)

%// type in downwards, ind to the right
output(:,:,1) =   %// 'C'

     2     1     1     0
     1     0     1     1


output(:,:,2) =   %// 'N'

     0     2     1     1
     1     1     1     1


output(:,:,3) =  %// 'U'

     0     0     1     0
     0     0     1     0

The following one liner looks close like your desired output:

output2 = reshape(crosstab(ind(:),list(:)-64,type(:)),4,[],1)

output2 =

     2     0     0     1     1     0
     1     2     0     0     1     0
     1     1     1     1     1     1
     0     1     0     1     1     0

Also in this toolbox, you can find the tabulate function which offers another option in combination with accumarray:

[~,~,c] = unique([type(:) ind(:)],'rows','stable')
output = accumarray(c(:),list(:),[],@(x) {tabulate(x)} )

Which also allows the following output:

d = unique([type(:) ind(:) list(:)-64],'rows','stable')
output2 = [num2cell(d(:,[1,2])) vertcat(output{:})]

output2 = 

    [15]    [1]    'C'    [2]    [    100]
    [15]    [2]    'N'    [2]    [66.6667]
    [15]    [2]    'C'    [1]    [33.3333]
    [15]    [3]    'U'    [1]    [33.3333]
    [15]    [3]    'C'    [1]    [33.3333]
    [15]    [3]    'N'    [1]    [33.3333]
    [15]    [4]    'N'    [1]    [    100]
    [16]    [1]    'N'    [1]    [     50]
    [16]    [1]    'C'    [1]    [     50]
    [16]    [2]    'N'    [1]    [    100]
    [16]    [3]    'U'    [1]    [33.3333]
    [16]    [3]    'N'    [1]    [33.3333]
    [16]    [3]    'C'    [1]    [33.3333]
    [16]    [4]    'N'    [1]    [     50]
    [16]    [4]    'C'    [1]    [     50]

Finding and counting strings using multiple index vectors

Answers (2)

Related Questions