kechap
kechap

Reputation: 2087

Counting same length items in a list

I am trying to port a cgi script using pythonic style of coding.

sequence = "aaaabbababbbbabbabb"
res = sequence.split("a") + sequence.split("b")
res = [l for l in res if l]

The result is

>>> res
['bb', 'b', 'bbbb', 'bb', 'bb', 'aaaa', 'a', 'a', 'a', 'a']

This was ~100loc in C. Now i want to count the items with the same length in the res list efficiently. For example here res contains 5 elements with length 1, 3 elements with length 2 and 2 elements with length 4.

The problem is that the sequence string can be very big.

Upvotes: 1

Views: 1362

Answers (2)

Sven Marnach
Sven Marnach

Reputation: 602285

The easiest way to generate a histogram of string lengths given a list of strings is to use collections.Counter:

>>> from collections import Counter
>>> a = ["a", "b", "aaa", "bb", "aa", "bbb", "", "a", "b"]
>>> Counter(map(len, a))
Counter({1: 4, 2: 2, 3: 2, 0: 1})

Edit: There is also a better way to find runs of equal characters, namely itertools.groupby():

>>> sequence = "aaaabbababbbbabbabb"
>>> Counter(len(list(it)) for k, it in groupby(sequence))
Counter({1: 5, 2: 3, 4: 2})

Upvotes: 7

Jack Edmonds
Jack Edmonds

Reputation: 33181

You could probably do something like

occurrences_by_length={} # map of length of string->number of strings with that length.
for i in (len(x) for x in (sequence.split("a")+sequence.split("b"))):
    if i in occurrences_by_length:
        occurrences_by_length[i]=occurrences_by_length[i]+1
    else:
        occurrences_by_length[i]=1

Now occurrences_by_length has a mapping of the length of each string to the number of times a string of that length appears.

Upvotes: 1

Related Questions