Reputation: 65
I code in Python, and I have a string which I want to count the number of occurrences of bigrams in that string. What I mean by that, is that for example I have the string "test string"
and I would like to iterate through that string in sub-strings of size 2 and create a dictionary of each bigram and the number of its occurrences in the original string.
Thus, I would like to get an output of the form {te: 1, es : 1, st: 2, ...}
.
Could you help me to get this started?
Best regards!
Upvotes: 1
Views: 589
Reputation: 2812
I think something like this is simple and easy to do, and there is no need to import
any library.
Firstly we remove all white-space from the string using join()
.
Then we construct a list
containing all sub-strings with a step of 2
.
Finally we construct and print()
the dictionary
which has all sub-strings as keys and their respective occurrences in the original string as values.
substr = [] # Initialize empty list that contains all substrings.
step = 2 # Initialize your step size.
s = ''.join('test string'.split()) # Remove all whitespace from string.
for i in range(len(s)):
substr.append(s[i: i + step])
# Construct and print a dictionary which counts all occurences of substrings.
occurences = {k: substr.count(k) for k in substr if len(k) == step}
print(occurences)
When run, it outputs a dictionary, as you requested:
{'te': 1, 'es': 1, 'st': 2, 'ts': 1, 'tr': 1, 'ri': 1, 'in': 1, 'ng': 1}
Upvotes: 1
Reputation: 8582
As a side note, you're looking for bigrams. For bigger scale – there's robust implementations in different machine-learning/NLP kits.
As an ad-hoc solution, problem should be decomposed to
Solution for problem #1 is pairwise
from itertools recipes
Solution for problem #2 is Counter
Putting all together is
from itertools import tee
def pairwise(iterable):
a, b = tee(iterable)
next(b, None)
return zip(a, b)
Counter(pairwise('test string'))
Upvotes: 1
Reputation: 45562
Given
s = "test string"
do
from collections import Counter
Counter(map(''.join, zip(s, s[1:])))
or
from collections import Counter
Counter(s[i:i+2] for i in range(len(s)-1))
The result of either is
Counter({'st': 2, 'te': 1, 'es': 1, 't ': 1, ' s': 1, 'tr': 1, 'ri': 1, 'in': 1, 'ng': 1})
Upvotes: 3