rrr
rrr

Reputation: 371

Finding the most common element in a list of variable-length tuples

I am writing a code that returns the most common bus stop number in a bus route (from a csv file). The first entry of 'bus_routes.csv' of each row is the name of the route, while rest of the entries are the bus stop numbers.

def most_common_number(routes):
    routes = load_routes('bus_routes.csv')
    bus_stop_ID = [col[1:] for col in routes]
    return max(sum(set((bus_stop_ID, []), key = sum(bus_stop_ID, []).count)

When I run this code, it returns an error; TypeError: can only concatenate list (not "tuple") to list type(bus_stop_number) is a list, so I cannot find where a tuple exists in my last code.

Upvotes: 3

Views: 152

Answers (1)

NPE
NPE

Reputation: 500713

I cannot find where a tuple exists in my last code

Each element of bus_stop_ID is tuple.

To recap, you have a list of variable-length tuples and you'd like to find the most common element across that entire structure.

This sounds like a perfect use for collections.Counter:

>>> from collections import Counter
>>> bus_stop_ID = [('id1', 'id2', 'id3'), ('id1',), ('id1', 'id3', 'id5')]
>>> [(stop_id, count)] = sum(map(Counter, bus_stop_ID), Counter()).most_common(1)
>>> stop_id
'id1'
>>> count
3

Here:

  • map(Counter, bus_stop_ID) counts bus stop ids in every route separately;
  • sum(..., Counter()) aggregates the counts across routes;
  • .most_common(1) returns the most common bus stop id in the aggregated counts (as a single-element list containing a 2-tuple that consists of a bus stop id and its count).

An alternative is to fist concatenate all the routes (using sum()) and then do the counting:

>>> [(stop_id, count)] = Counter(sum(bus_stop_ID, ())).most_common(1)

This is shorter but less efficient since it needs to concatenate all the routes into a single (potentially giant) tuple.

A — somewhat fancy — way to sidestep this inefficiency is by using itertools.chain():

>>> Counter(itertools.chain(*bus_stop_ID)).most_common(1)
[('id1', 3)]

This simply iterates over all bus stop ids in the first route, then the second and so on, doing the counting along the way.

Out of curiosity, I benchmarked all three methods on an arbitrarily chosen list of 10,000 tuples and the results are quite interesting:

In [11]: %timeit sum(map(Counter, bus_stop_ID), Counter()).most_common(1)
1 loop, best of 3: 235 ms per loop

In [12]: %timeit Counter(sum(bus_stop_ID, ())).most_common(1)
1 loop, best of 3: 2.64 s per loop

In [13]: %timeit Counter(itertools.chain(*bus_stop_ID)).most_common(1)
10 loops, best of 3: 17.3 ms per loop

Upvotes: 4

Related Questions