Reputation: 774
I have a list of tweets (tokenized and preprocessed). It's like this:
['AT_TOKEN',
'what',
'AT_TOKEN',
'said',
'END',
'AT_TOKEN',
'plus',
'you',
've',
'added',
'commercials',
'to',
'the',
'experience',
'tacky',
'END',
'AT_TOKEN',
'i',
'did',
'nt',
'today',
'must',
'mean',
'i',
'need',
'to',
'take',
'another',
'trip',
'END']
END signifies that a tweet has ended and a new one has begun.
I want to find the bigram vocabulary for this list but having a hard time how can I do it efficiently. I have figured out how I can do this for a unigram like this:
unique_words = defaultdict(int)
for i in range(len(data)):
unique_words[data[i]] = 1
return list(unique_words.keys())
Problem is that I need to first convert this list into bigram and then find the vocabulary for that bigram.
Can anybody help me figure this out?
Upvotes: 0
Views: 405
Reputation: 456
To complement furas' answer. You can utilize collections.Counter
and itertools.pairwise
if you are on Python 3.10 to count bigrams extremely efficiently:
from collections import Counter
from itertools import pairwise
# c = Counter(zip(data, data[1:])) on Python < 3.10
c = Counter(pairwise(data))
print(c)
Output:
Counter({('END', 'AT_TOKEN'): 2, ('AT_TOKEN', 'what'): 1, ('what', 'AT_TOKEN'): 1, ('AT_TOKEN', 'said'): 1, ('said', 'END'): 1, ...
Counter
works just like a dictionary, but extends it with some useful methods. See https://docs.python.org/3/library/collections.html#collections.Counter
Upvotes: 1
Reputation: 142631
For single words you would need only set()
(without defaultdict
)
unique_words = list(set(data))
print(unique_words)
For two words you can use for
-loop with data[i:i+2]
and len(data)-1
(without defaultdict
)
all_bigrams = []
for i in range(len(data)-1):
all_bigrams.append( tuple(data[i:i+2]) )
unique_bigrams = list(set(all_bigrams))
print(unique_bigrams)
or using directly set()
without all_bigrams
unique_bigrams = set()
for i in range(len(data)-1):
unique_bigrams.add( tuple(data[i:i+2]) )
unique_bigrams = list(unique_bigrams)
print(unique_bigrams)
The same for three words but with data[i:i+3]
and len(data)-2
all_threewords = []
for i in range(len(data)-2):
all_threewords.append( tuple(data[i:i+3]) )
unique_threewords = list(set(all_threewords))
print(unique_threewords)
or using directly set()
without all_threewords
unique_threewords = set()
for i in range(len(data)-2):
unique_threewords.add( tuple(data[i:i+3]) )
unique_threewords = list(unique_threewords)
print(unique_threewords)
Full working example
data = ['AT_TOKEN',
'what',
'AT_TOKEN',
'said',
'END',
'AT_TOKEN',
'plus',
'you',
've',
'added',
'commercials',
'to',
'the',
'experience',
'tacky',
'END',
'AT_TOKEN',
'i',
'did',
'nt',
'today',
'must',
'mean',
'i',
'need',
'to',
'take',
'another',
'trip',
'END']
# ---
unique_words = list(set(data))
print(unique_words)
# ---
all_bigrams = []
for i in range(len(data)-1):
all_bigrams.append( tuple(data[i:i+2]) )
unique_bigrams = list(set(all_bigrams))
print(unique_bigrams)
# ---
unique_bigrams = set()
for i in range(len(data)-1):
unique_bigrams.add( tuple(data[i:i+2]) )
unique_bigrams = list(unique_bigrams)
print(unique_bigrams)
# ---
all_threewords = []
for i in range(len(data)-2):
all_threewords.append( tuple(data[i:i+3]) )
unique_threewords = list(set(all_threewords))
print(unique_threewords)
# ---
unique_threewords = set()
for i in range(len(data)-2):
unique_threewords.add( tuple(data[i:i+3]) )
unique_threewords = list(unique_threewords)
print(unique_threewords)
But I don't know if you need pairs like ('END', 'AT_TOKEN')
or any pair with 'END'
or 'AT_TOKEN'
.
It would need first convert to sublists
data = [
['AT_TOKEN', 'what'],
['AT_TOKEN', 'said', 'END'],
['AT_TOKEN', 'plus', 'you', 've', 'added',
'commercials', 'to', 'the', 'experience',
'tacky', 'END'],
['AT_TOKEN', 'i', 'did', 'nt', 'today',
'must', 'mean', 'i', 'need', 'to', 'take',
'another', 'trip', 'END']
]
and later work with every sublist separatelly.
Upvotes: 0