How to split a set of strings into substrings in Python, making shorter substrings more likely?

Question

I have a set of strings which are some millions of characters each. I want to split them into substrings of random length, and this I can do with no particular issue.

However, my question is: how can I apply some sort of weight to the substring length choice? My code runs in python3, so I would like to find a pythonic solution. In detail, my aim is to:

split the strings into substrings that range in length between 1*e04 and 8*e06 characters.
make it so, that the script chooses more often a short length (1*e04) over a long length (8*e06) for the newly generated substrings, like a descending length likelihood gradient.

Thanks for the help!

markuscosinus · Accepted Answer

There are probably many ways to do this. I would do it as follows:

Take a random number rand in the interval [0,1]:
```
import random
rand = random.random()
```
Use an operation on that number to make smaller numbers more likely, but stay in the range of [0,1]. What operation you use depends on how you want your likelihood distribution to look like. A simple choice would be the square.
```
rand = rand**2
```
Scale the number space [0,1] up to [1e04, 8e06] and round to the next integer:
```
subStringLen = round(rand*(8e06-1e04)+1e04)
```
Get the substring of length subStringLen from your string and check how many characters are left.
- If there are more than 8e06 characters left go to step 1.
- If there are between 1e04 and 8e06, use them as your last substring.
- If there are less than 1e04 you need to decide if you want to throw the rest away or allow substrings smaller than 1e04 in this speciel case.

I'm sure there is a lot of improvements possible in terms of efficiency, this is just to give you an idea of my method.

How to split a set of strings into substrings in Python, making shorter substrings more likely?

Answers (2)

Related Questions