Eugene
Eugene

Reputation: 67

Is a way to skip some combinations in itertools?

Hi guys I have a json file with 1200 city names and their populations

I'm combining those cities by 3 using itertools.combination:

import json
from itertools import combinations as com

    def main():
        data = json.load(open('new.json'))
        x = com(data, 3)
        triples_dict = [i for i in x]
    main()

This is just a test code, but it takes forever to run, I know it's a lot of combinations, and I don't need actually that many - later I will choose from overall results only 10000 combinations based on some linear probability - the bigger the sum of 3 cities' population the less possible this combination will be included into those 10000 combinations.

I am looking for a way to optimize the combination phase, I was thinking about skipping some combinations, for example, to skip every 4th combination, but I can't find a way to do that

Upvotes: 0

Views: 720

Answers (2)

o11c
o11c

Reputation: 16076

You don't need itertools at all. Just manually create some random combinations using random.sample.

Theoretically this can return the same sample more than once, though that is quite unlikely. If this is a concern, you can use a set to track tuples you've seen.

#!/usr/bin/env python3

import random

def load_cities():
    # hard-code some data from Wikipedia instead of loading from JSON
    # to simplify the demonstration
    return [
        ("Tōkyō", 37400068),
        ("Delhi", 28514000),
        ("Shanghai", 25582000),
        ("São Paulo", 21650000),
        ("Mexico City", 21581000),
        ("Cairo", 20076000),
        ("Mumbai", 19980000),
        ("Beijing", 19618000),
        ("Dhaka", 19578000),
        ("Osaka", 19281000),
        ("New York", 18819000),
        ("Karachi", 15400000),
        ("Buenos Aires", 14967000),
        ("Chongqing", 14838000),
        ("Istanbul", 14751000),
        ("Kolkata", 14681000),
        ("Manila", 13482000),
        ("Lagos", 13463000),
        ("Rio de Janeiro", 13293000),
        ("Tianjin", 13215000),
        ("Kinshasa", 13171000),
        ("Guangzhou", 12638000),
        ("Los Angeles", 12458000),
        ("Moscow", 12410000),
        ("Shenzhen", 11908000),
        ("Lahore", 11738000),
        ("Bangalore", 11440000),
        ("Paris", 10901000),
        ("Bogotá", 10574000),
        ("Jakarta", 10517000),
        ("Chennai", 10456000),
        ("Lima", 10391000),
        ("Bangkok", 10156000),
        ("Seoul", 9963000),
        ("Nagoya", 9507000),
        ("Hyderabad", 9482000),
        ("London", 9046000),
        ("Tehran", 8896000),
        ("Chicago", 8864000),
        ("Chengdu", 8813000),
        ("Nanjing", 8245000),
        ("Wuhan", 8176000),
        ("Ho Chi Minh City", 8145000),
        ("Luanda", 7774000),
        ("Ahmedabad", 7681000),
        ("Kuala Lumpur", 7564000),
        ("Xi'an", 7444000),
        ("Hong Kong", 7429000),
        ("Dongguan", 7360000),
        ("Hangzhou", 7236000),
        ("Foshan", 7236000),
        ("Shenyang", 6921000),
        ("Riyadh", 6907000),
        ("Baghdad", 6812000),
        ("Santiago", 6680000),
        ("Surat", 6564000),
        ("Madrid", 6497000),
        ("Suzhou", 6339000),
        ("Pune", 6276000),
        ("Harbin", 6115000),
        ("Houston", 6115000),
        ("Dallas", 6099000),
        ("Toronto", 6082000),
        ("Dar es Salaam", 6048000),
        ("Miami", 6036000),
        ("Belo Horizonte", 5972000),
        ("Singapore", 5792000),
        ("Philadelphia", 5695000),
        ("Atlanta", 5572000),
        ("Fukuoka", 5551000),
        ("Khartoum", 5534000),
        ("Barcelona", 5494000),
        ("Johannesburg", 5486000),
        ("Saint Petersburg", 5383000),
        ("Qingdao", 5381000),
        ("Dalian", 5300000),
        ("Washington, D.C.", 5207000),
        ("Yangon", 5157000),
        ("Alexandria", 5086000),
        ("Jinan", 5052000),
        ("Guadalajara", 5023000),
    ]

def main():
    all_cities = load_cities()
    for _ in range(10):
        some_cities = random.sample(all_cities, 3)
        total_pop = 0
        names = []
        for city, pop in some_cities:
            names.append(city)
            total_pop += pop
        names[-1] = 'and ' + names[-1]
        names = ', '.join(names)
        print('The total population of', names, 'is', total_pop)

if __name__ == '__main__':
    main()

Upvotes: 2

Akshay Sehgal
Akshay Sehgal

Reputation: 19322

Try this with building a generator comprehension with the condition you need.

  1. Here I take random 12000 IDs
  2. Then I create a generator from the combinations of these ids and keep only the ones that have an index as a multiple of 4. (you can change that with more conditions such as i%4<3 etc.
  3. Finally, I loop 10 times (change to 10000 etc.) to get the first 10 combinations, without making it run infinitely.
from itertools import combinations
import numpy as np

ids = np.random.randint(0,500000,(12000,))

generator = (j for i,j in enumerate(combinations(ids, 3)) if i%4==0)

for i in range(10):
  print(next(generator))
(105466, 481866, 68347)
(105466, 481866, 267588)
(105466, 481866, 298501)
(105466, 481866, 318543)
(105466, 481866, 479957)
(105466, 481866, 327264)
(105466, 481866, 437524)
(105466, 481866, 61672)
(105466, 481866, 390489)
(105466, 481866, 90756)

... the bigger the sum of 3 cities' population the less possible this combination will be included into those 10000 combinations.

  1. This might be difficult without an analytical approach. There is no way you can estimate the population sum of every combination of the city without running the combinations.

  2. What this means is, you may benefit from simply analyzing the distribution of population for a single list of cities.

import matplotlib.pyplot as plt
import numpy as np
from itertools import combinations

a = np.random.normal(size=(100,))

fig, axes = plt.subplots(1,2, figsize=(10,5))

axes[0].hist(a)
axes[1].hist([sum(i) for i in combinations(a,3)])

## NOTICE THE DISTRIBUTION OF THE ORIGINAL LIST AND COMBINATIONS

enter image description here

  1. This will allow you to estimate the distribution of the sum of populations and set that condition in the generator object itself. You can estimate what would be a cutoff for your population sum over which you want to sample.

  2. Another way is to take a large enough sample from the generator to estimate the population distribution from the sample distribution. Then use that as a condition to fetch enough samples for your downstream task.

Here is how the code would look like in general -

cap = 1000000

generator = (j for i,j in enumerate(combinations(ids, 3)) if i%4==0 and sum(j)>cap)

for i in range(10):
  comb = next(generator)
  print(comb,'->',sum(comb))
(105466, 481866, 479957) -> 1067289
(105466, 481866, 437524) -> 1024856
(105466, 481866, 481925) -> 1069257
(105466, 481866, 427610) -> 1014942
(105466, 481866, 490693) -> 1078025
(105466, 481866, 435187) -> 1022519
(105466, 481866, 499844) -> 1087176
(105466, 481866, 422053) -> 1009385
(105466, 481866, 436762) -> 1024094
(105466, 481866, 490919) -> 1078251

Upvotes: 1

Related Questions