Bouteloua89
Bouteloua89

Reputation: 31

random numpy array for DNA bases

I'm wondering how to get a random numpy array of integers using DNA bases. I have the basic numpy function working, but I can't accomplish this without transforming the numpy array into a list of strings and back to integers. So I failed

#A = 1
#T = 2
#G = 3
#C = 4

np.random.randint(1, 5, size=(5, 3))

array([[1, 2, 1],
   [2, 2, 3],
   [2, 4, 2],
   [4, 2, 1],
   [1, 3, 4]])

Desirable output will be integers in a numpy array

array([[121],
   [223],
   [242],
   [421],
   [134]])

Thank you for any ideas

Upvotes: 0

Views: 229

Answers (3)

gabe
gabe

Reputation: 2511

Here's another answer with numpy.

Strategy: First precompute the bases (there are only 64 of them so it's no biggie), and then use np.random.choice.

from itertools import product

nums = "1234"
bases = map(int,map("".join, product(nums,nums,nums)))
np.random.choice(bases,10**8)

Casting as an integer happens during the precompute step and so won't be a bottle neck. Generates a hundred million base pairs in no time on a macbook.

Note:

If you want to compute a lot of basepairs, this way is about 5 times faster (3 seconds vs 17 seconds for 10**8 random bases) than the one-liner that first generates random numbers and then takes the dot product. That strategy requires two passes over the data instead of mine -- which takes one pass.

In general, if you want d base pairs and N sample, then this does the trick:

bases = map(int,map("".join, product(*[nums]*d))
np.random.choice(bases,N)

If d is larger than 8 or 9, then bases will be sufficiently long that you probably would be better off going with the other version using the dot product. But if d is small -- then this is definitely faster.

Upvotes: 1

Falko
Falko

Reputation: 17867

Why not constructing a 3-digit integer from the 3 separate integers you already have:

import numpy as np

r = np.random.randint(1, 5, size=(5, 3))

print (r[:, 0] * 100 + r[:, 1] * 10 + r[:, 2])[:, None]

Output:

[[444]
 [332]
 [213]
 [434]
 [341]]

Depending on the required output shape you might not need to do the reshape via [:, None]. But this version yields exactly the example output format.


One-liner:

A more compact version uses the dot product between the random matrix and a vector of decimal powers:

print np.random.randint(1, 5, size=(5, 3)).dot([100, 10, 1])[:, None]

More flexible:

In general, you can generate the array depending on the number of rows n and columns d:

print np.random.randint(1, n, size=(n, d)).dot(np.power(10, range(d)))[:, None]

Upvotes: 2

jakebrinkmann
jakebrinkmann

Reputation: 805

Think you would be best off using the methodology described in your question... int() --> str() --> int()

>>> thing = np.random.randint(1, 5, size=(5, 3))
>>> [int(''.join([str(x) for x in a])) for a in thing]
Out[47]: [414, 311, 221, 232, 131]

Or, for a numpy type answer:

>>> foo = lambda x: int(''.join([str(n) for n in x]))
>>> np.apply_along_axis(foo, 1, thing)
Out[7]: array([414, 311, 221, 232, 131])

Upvotes: 0

Related Questions