Reputation: 67
Hi guys I have a json file with 1200 city names and their populations
I'm combining those cities by 3 using itertools.combination:
import json
from itertools import combinations as com
def main():
data = json.load(open('new.json'))
x = com(data, 3)
triples_dict = [i for i in x]
main()
This is just a test code, but it takes forever to run, I know it's a lot of combinations, and I don't need actually that many - later I will choose from overall results only 10000 combinations based on some linear probability - the bigger the sum of 3 cities' population the less possible this combination will be included into those 10000 combinations.
I am looking for a way to optimize the combination phase, I was thinking about skipping some combinations, for example, to skip every 4th combination, but I can't find a way to do that
Upvotes: 0
Views: 720
Reputation: 16076
You don't need itertools
at all. Just manually create some random combinations using random.sample
.
Theoretically this can return the same sample more than once, though that is quite unlikely. If this is a concern, you can use a set
to track tuples you've seen.
#!/usr/bin/env python3
import random
def load_cities():
# hard-code some data from Wikipedia instead of loading from JSON
# to simplify the demonstration
return [
("Tōkyō", 37400068),
("Delhi", 28514000),
("Shanghai", 25582000),
("São Paulo", 21650000),
("Mexico City", 21581000),
("Cairo", 20076000),
("Mumbai", 19980000),
("Beijing", 19618000),
("Dhaka", 19578000),
("Osaka", 19281000),
("New York", 18819000),
("Karachi", 15400000),
("Buenos Aires", 14967000),
("Chongqing", 14838000),
("Istanbul", 14751000),
("Kolkata", 14681000),
("Manila", 13482000),
("Lagos", 13463000),
("Rio de Janeiro", 13293000),
("Tianjin", 13215000),
("Kinshasa", 13171000),
("Guangzhou", 12638000),
("Los Angeles", 12458000),
("Moscow", 12410000),
("Shenzhen", 11908000),
("Lahore", 11738000),
("Bangalore", 11440000),
("Paris", 10901000),
("Bogotá", 10574000),
("Jakarta", 10517000),
("Chennai", 10456000),
("Lima", 10391000),
("Bangkok", 10156000),
("Seoul", 9963000),
("Nagoya", 9507000),
("Hyderabad", 9482000),
("London", 9046000),
("Tehran", 8896000),
("Chicago", 8864000),
("Chengdu", 8813000),
("Nanjing", 8245000),
("Wuhan", 8176000),
("Ho Chi Minh City", 8145000),
("Luanda", 7774000),
("Ahmedabad", 7681000),
("Kuala Lumpur", 7564000),
("Xi'an", 7444000),
("Hong Kong", 7429000),
("Dongguan", 7360000),
("Hangzhou", 7236000),
("Foshan", 7236000),
("Shenyang", 6921000),
("Riyadh", 6907000),
("Baghdad", 6812000),
("Santiago", 6680000),
("Surat", 6564000),
("Madrid", 6497000),
("Suzhou", 6339000),
("Pune", 6276000),
("Harbin", 6115000),
("Houston", 6115000),
("Dallas", 6099000),
("Toronto", 6082000),
("Dar es Salaam", 6048000),
("Miami", 6036000),
("Belo Horizonte", 5972000),
("Singapore", 5792000),
("Philadelphia", 5695000),
("Atlanta", 5572000),
("Fukuoka", 5551000),
("Khartoum", 5534000),
("Barcelona", 5494000),
("Johannesburg", 5486000),
("Saint Petersburg", 5383000),
("Qingdao", 5381000),
("Dalian", 5300000),
("Washington, D.C.", 5207000),
("Yangon", 5157000),
("Alexandria", 5086000),
("Jinan", 5052000),
("Guadalajara", 5023000),
]
def main():
all_cities = load_cities()
for _ in range(10):
some_cities = random.sample(all_cities, 3)
total_pop = 0
names = []
for city, pop in some_cities:
names.append(city)
total_pop += pop
names[-1] = 'and ' + names[-1]
names = ', '.join(names)
print('The total population of', names, 'is', total_pop)
if __name__ == '__main__':
main()
Upvotes: 2
Reputation: 19322
Try this with building a generator comprehension with the condition you need.
i%4<3
etc.from itertools import combinations
import numpy as np
ids = np.random.randint(0,500000,(12000,))
generator = (j for i,j in enumerate(combinations(ids, 3)) if i%4==0)
for i in range(10):
print(next(generator))
(105466, 481866, 68347)
(105466, 481866, 267588)
(105466, 481866, 298501)
(105466, 481866, 318543)
(105466, 481866, 479957)
(105466, 481866, 327264)
(105466, 481866, 437524)
(105466, 481866, 61672)
(105466, 481866, 390489)
(105466, 481866, 90756)
... the bigger the sum of 3 cities' population the less possible this combination will be included into those 10000 combinations.
This might be difficult without an analytical approach. There is no way you can estimate the population sum of every combination of the city without running the combinations.
What this means is, you may benefit from simply analyzing the distribution of population for a single list of cities.
import matplotlib.pyplot as plt
import numpy as np
from itertools import combinations
a = np.random.normal(size=(100,))
fig, axes = plt.subplots(1,2, figsize=(10,5))
axes[0].hist(a)
axes[1].hist([sum(i) for i in combinations(a,3)])
## NOTICE THE DISTRIBUTION OF THE ORIGINAL LIST AND COMBINATIONS
This will allow you to estimate the distribution of the sum of populations and set that condition in the generator object itself. You can estimate what would be a cutoff for your population sum over which you want to sample.
Another way is to take a large enough sample from the generator to estimate the population distribution from the sample distribution. Then use that as a condition to fetch enough samples for your downstream task.
Here is how the code would look like in general -
cap = 1000000
generator = (j for i,j in enumerate(combinations(ids, 3)) if i%4==0 and sum(j)>cap)
for i in range(10):
comb = next(generator)
print(comb,'->',sum(comb))
(105466, 481866, 479957) -> 1067289
(105466, 481866, 437524) -> 1024856
(105466, 481866, 481925) -> 1069257
(105466, 481866, 427610) -> 1014942
(105466, 481866, 490693) -> 1078025
(105466, 481866, 435187) -> 1022519
(105466, 481866, 499844) -> 1087176
(105466, 481866, 422053) -> 1009385
(105466, 481866, 436762) -> 1024094
(105466, 481866, 490919) -> 1078251
Upvotes: 1