Reputation: 612
I have a set of strings which are some millions of characters each. I want to split them into substrings of random length, and this I can do with no particular issue.
However, my question is: how can I apply some sort of weight to the substring length choice? My code runs in python3
, so I would like to find a pythonic solution. In detail, my aim is to:
Thanks for the help!
Upvotes: 1
Views: 185
Reputation: 31
NumPy
supplies lots of random samping functions. Have a look through the various distributions available.
If you're looking for something that it weighted towards the lower end of the scale, maybe the exponential distribution would work?
With matplotlib
you can plot the histogram of the values, so you can get a better idea if the distribution fits what you want.
So something like this:
import numpy as np
import matplotlib.pyplot as plt
# desired range of values
mn = 1e04
mx = 8e06
# random values following exp distribution
values = np.random.exponential(scale=1, size=2000)
# scale the values to the desired range
values = ((mx-mn)*values/np.max(values)) + mn
# plot the distribution of values
plt.hist(values)
plt.grid()
plt.show()
plt.close()
Upvotes: 2
Reputation: 2267
There are probably many ways to do this. I would do it as follows:
rand
in the interval [0,1]
:
import random
rand = random.random()
[0,1]
. What operation you use depends on how you want your likelihood distribution to look like. A simple choice would be the square.
rand = rand**2
[0,1]
up to [1e04, 8e06]
and round to the next integer:
subStringLen = round(rand*(8e06-1e04)+1e04)
subStringLen
from your string and check how many characters are left.
8e06
characters left go to step 1.1e04
and 8e06
, use them as your last substring.1e04
you need to decide if you want to throw the rest away or allow substrings smaller than 1e04
in this speciel case.I'm sure there is a lot of improvements possible in terms of efficiency, this is just to give you an idea of my method.
Upvotes: 1