What type of sparse vector should I use?

Question

Data

I have N different (sorted) vectors of indices (std::vector). The indices are in the range [0; L-1]. Here are two rules of thumbs about this data:

Only about 0.1% to 10% of possible index are present anywhere
If an index is found in a given vector, then it will likely be found multiple times again in other vectors.

Hence a possible data set with N=10 vectors and with L = 200 could be

{45, 110, 119, 145, 170}
{9, 45, 110, 145, 178, 170}
{45, 145}
{45, 178, 183}
{45, 53, 110, 170}
{9, 119, 123, 179}
{9, 45, 119, 130, 131, 170, 190, 199}
{9, 45, 110, 170, 199}
{31, 45, 145}
{9, 178, 183}

Goal

I would like to compute the frequencies of every index. I would do something like

std::vector computeFrequencies(std::vector>& data)
{
    assert(data.size() == N);

    std::vector frequencies(L);
    for (unsigned Ni = 0 ; Ni < N ; Ni++)
    {
        for (unsigned i = 0 ; i < data[Ni].size() ; i++)
        {
            assert(data[Ni][i] < L)
            frequencies[data[Ni][i]]++;
        }
    }

    for (unsigned i = 0 ; i < L; i++)
    {
        frequencies[i] /= (double) N;
    }

    return(frequencies);    
}

I will then loop again through the object returned by the function computeFrequencies only once.

for (unsigned i = 0 ; i < L; i++)
{
    foo(frequencies[i]);
}

Question

The object frequencies contains a lot fo zeros and I should hence be using a sparse vector instead. I don't have much understanding of sparse matrices though. What type of sparse vector should I use?

I am considering using boost::numeric::ublas::coordinate_matrix because as I loop through all N vectors, I would constantly be adding new non-zeros values and I think a coordinate matrix would be good for dealing with that. Note that generally speaking, for this function, I am more worried about RAM usage than about computational time.

Matt Timmermans · Accepted Answer

It doesn't look like a sparse vector representation is a good fit for your problem.

To accomplish your task as you describe it:

Merge your sorted vectors into a single sorted vector. How to do an efficient K-way merge pops up here every now and then: merging N sorted files using K way merge
Iterate through the new vector and count the number of duplicates of each entry (easy since they'll all be together) to get your frequencies and foo them as you go.

You can even do both steps at the same time, entirely avoiding the need to copy the data into a new structure.

What type of sparse vector should I use?

Answers (1)

Related Questions