Ed Doe
Ed Doe

Reputation: 45

Regrouping a list positionally into quantiles

I have a dict in which each key corresponds to a gene name, and each value corresponds to a list. The length of the list is different for each gene, because each element represents a different nucleotide. The number at each position indicates the "score" of the nucleotide.

Because each gene is a different length, I want to be able to directly compare their positional score distributions by splitting each gene up into quantiles (most likely, percentiles: 100 bins).

Here is some simulation data:

myData = {
'Gene1': [3, 1, 1, 2, 3, 1, 1, 1, 3, 0, 0, 0, 3, 3, 3, 0, 1, 2, 1, 3, 2, 2, 0, 2, 0, 1, 0, 3, 0, 3, 1, 1, 0, 3, 0, 0, 1, 0, 1, 0, 1, 3, 3, 2, 3, 1, 0, 1, 2, 2, 0, 3, 0, 2, 0, 1, 1, 2, 3, 3, 1, 2, 1, 3, 1, 0, 0, 3, 2, 0, 3, 0, 2, 1, 1, 1, 2, 1, 1, 3, 0, 1, 1, 1, 3, 3, 0, 2, 2, 1, 3, 2, 3, 0, 2, 3, 2, 1, 3, 1, 3, 2, 1, 3, 0, 3, 3, 0, 0, 1, 0, 3, 1, 1, 3, 0, 0, 2, 3, 1, 0, 2, 1, 2, 1, 2, 1, 2, 0, 1, 1, 1, 3, 1, 3, 1, 3, 2, 3, 3, 3, 1, 1, 2, 1, 0, 2, 2, 2, 0, 1, 0, 3, 1, 3, 2, 1, 3, 0, 1, 3, 1, 0, 1, 2, 1, 2, 2, 3, 2, 3, 2, 2, 2, 1, 2, 2, 0, 3, 1, 2, 1, 1, 3, 2, 2, 1, 3, 1, 0, 1, 3, 2, 2, 3, 0, 0, 1, 0, 0, 3],
'Gene2': [3, 0, 0, 0, 3, 3, 1, 3, 3, 1, 0, 0, 1, 0, 1, 1, 3, 2, 2, 2, 0, 1, 3, 2, 1, 3, 1, 1, 2, 3, 0, 2, 0, 2, 1, 3, 3, 3, 1, 2, 3, 2, 3, 1, 3, 0, 1, 1, 1, 1, 3, 2, 0, 3, 0, 1, 1, 2, 3, 0, 2, 1, 3, 3, 0, 3, 2, 1, 1, 2, 0, 0, 1, 3, 3, 2, 2, 3, 1, 2, 1, 1, 0, 0, 1, 0, 3, 2, 3, 0, 2, 0, 2, 0, 2, 3, 0, 3, 0, 3, 2, 2, 0, 2, 3, 0, 2, 2, 3, 0, 3, 1, 2, 3, 0, 1, 0, 2, 3, 1, 3, 1, 2, 3, 1, 1, 0, 1, 3, 0, 2, 3, 3, 3, 3, 0, 1, 2, 2, 2, 3, 0, 3, 1, 0, 2, 3, 1, 0, 1, 1, 0, 3, 3, 1, 2, 1, 2, 3, 2, 3, 1, 2, 0, 2, 3, 1, 2, 3, 2, 1, 2, 2, 0, 0, 0, 0, 2, 0, 2, 3, 0, 2, 0, 0, 2, 0, 3, 3, 0, 1, 2, 3, 1, 3, 3, 1, 2, 1, 2, 1, 3, 2, 0, 2, 3, 0, 0, 0, 1, 1, 0, 1, 2, 0, 1, 2, 1, 3, 3, 0, 2, 2, 1, 0, 1, 1, 1, 0, 0, 2, 1, 2, 0, 1, 2, 1, 1, 3, 0, 1, 0, 1, 2, 1, 3, 0, 2, 3, 1, 2, 0, 0, 3, 2, 0, 3, 2, 1, 2, 3, 1, 0, 1, 0, 0, 1, 2, 3, 3, 2, 2, 1, 2, 2, 3, 3, 3, 3, 0, 0, 2, 2, 2, 2, 3, 2, 3, 2, 0, 3, 1, 0, 2, 3, 0, 1, 2, 2, 0, 2],
'Gene3': [2, 3, 1, 0, 3, 2, 1, 0, 1, 2, 1, 2, 1, 3, 0, 2, 2, 3, 2, 0, 0, 0, 1, 1, 1, 1, 0, 0, 2, 3, 2, 2, 1, 3, 1, 2, 3, 0, 0, 3, 1, 0, 3, 2, 2, 3, 0, 0, 3, 3, 1, 1, 1, 0, 0, 2, 3, 2, 0, 2, 0, 1, 0, 2, 3, 0, 2, 0, 3, 3, 0, 0, 1, 0, 3, 2, 1, 1, 3, 3, 0, 2, 3, 1, 1, 0, 1, 3, 2, 1, 0, 3, 2, 0, 3, 2, 1, 1, 0, 3, 0, 0, 2, 0, 3, 3, 0, 2, 0, 3, 3, 2, 0, 0, 2, 2, 0, 2, 0, 0, 2, 3, 3, 3, 3, 1, 3, 0, 0, 3, 1, 0, 2, 2, 0, 0, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 3, 0, 0, 3, 0, 2, 2, 0, 0, 3, 0, 1, 3, 1, 1, 0, 2, 2, 3, 3, 0, 2, 0, 0, 2, 3, 1, 2, 1, 1, 2, 2, 0, 0, 3, 2, 2, 2, 1, 2, 0, 3, 2, 2, 2, 2, 1, 0, 3, 2, 2, 1, 0, 0, 2, 2, 0, 3, 2, 0, 2, 2, 1, 1, 1, 2, 1, 2, 0, 1, 0, 3, 2, 0, 2, 3, 3, 0, 2, 2, 0, 1, 1, 3, 0, 0, 1, 2, 3, 1, 3, 2, 3, 3, 2, 0, 0, 0, 0, 0, 2, 1, 0, 0, 1, 1, 2, 1, 3, 1, 3, 1, 1, 0, 3, 0, 1, 1, 1, 1, 1, 0, 2, 1, 2, 1, 2, 0, 2, 0, 0, 2, 2, 2, 3, 3, 0, 0, 3, 2, 1, 2, 1, 0, 3, 2, 3, 1, 1, 0, 1, 3, 2, 0, 3, 1, 3, 1, 2, 0, 0, 2, 3, 2, 2, 0, 3, 0, 2, 2, 2, 3, 3, 2, 1, 3, 3, 0, 2, 2, 2, 1, 1, 2, 1, 3, 2, 3, 2, 1, 3, 1, 0, 0, 2, 0, 1, 1, 3, 3, 0, 1, 2, 3, 1, 2, 3, 1, 1, 1, 2, 0, 2, 0, 1, 0, 3, 1, 0, 3, 3, 1, 3, 1, 1, 2, 2, 0, 2, 0, 1, 0, 3, 1, 1, 1, 3, 3, 0, 0, 1, 1, 2, 3, 0, 2, 0, 1, 1, 3, 3, 1, 1, 0, 0, 2, 0, 1, 2, 2, 2, 3, 1, 1, 1, 0, 3, 0, 0, 0, 1, 0, 1, 3, 1, 2, 2, 1, 2, 2]
}

As you can see, Gene1 has a length of 201, and Gene2 has a length of 301. However, Gene3 has a length of 428. I want to summarize each of these lists so that, for an arbitrary number of bins (nBins), I can partition the list into a list of lists.

For example, for the first two genes, if I chose nBins=100, then Gene1 would look like [[3,1],[1,2],[3,1],[1,1]...] while Gene2 would look like [[3,0,0],[0,3,3],[1,3,3]...]. That is, I want to partition based on the positions and not the values themselves. My dataset is large, so I'm looking for a library that can do this most efficiently.

Upvotes: 0

Views: 38

Answers (2)

Acccumulation
Acccumulation

Reputation: 3591

Are you sure the length of Gene1 isn't 201?

You don't say what you want to happen in the case where the length isn't divisible by the number of bins. My code mixes sublists of length floor(length/nBins) and ceiling(length/nBins) to get the right number of bins.

new_data = {key : [value[
           int(bin_number*len(value)/nBins):
           int((bin_number+1)*len(value)/nBins)
                ]
            for bin_number in range(nBins)] for key, value in myData.items()}

Upvotes: 1

MegaIng
MegaIng

Reputation: 7886

You don't need a library. Pure python should be fast enough in 90% of the cases:

nBins = 100

def group(l, size):
    return [l[i:i + size] for i in range(0, len(l) + len(l) % size, size)]


bin_data = {k: group(l, len(l) // nBins ) for k, l in myData.items()}
print(bin_data)

Upvotes: 1

Related Questions