Ashar
Ashar

Reputation: 774

Python: Find vocabulary of a bigram

I have a list of tweets (tokenized and preprocessed). It's like this:

['AT_TOKEN',
 'what',
 'AT_TOKEN',
 'said',
 'END',
 'AT_TOKEN',
 'plus',
 'you',
 've',
 'added',
 'commercials',
 'to',
 'the',
 'experience',
 'tacky',
 'END',
 'AT_TOKEN',
 'i',
 'did',
 'nt',
 'today',
 'must',
 'mean',
 'i',
 'need',
 'to',
 'take',
 'another',
 'trip',
 'END']

END signifies that a tweet has ended and a new one has begun.

I want to find the bigram vocabulary for this list but having a hard time how can I do it efficiently. I have figured out how I can do this for a unigram like this:

unique_words = defaultdict(int)
for i in range(len(data)):
    unique_words[data[i]] = 1
return list(unique_words.keys())

Problem is that I need to first convert this list into bigram and then find the vocabulary for that bigram.

Can anybody help me figure this out?

Upvotes: 0

Views: 405

Answers (2)

mwo
mwo

Reputation: 456

To complement furas' answer. You can utilize collections.Counter and itertools.pairwise if you are on Python 3.10 to count bigrams extremely efficiently:

from collections import Counter
from itertools import pairwise  

# c = Counter(zip(data, data[1:])) on Python < 3.10
c = Counter(pairwise(data))

print(c)

Output:

Counter({('END', 'AT_TOKEN'): 2, ('AT_TOKEN', 'what'): 1, ('what', 'AT_TOKEN'): 1, ('AT_TOKEN', 'said'): 1, ('said', 'END'): 1, ...

Counter works just like a dictionary, but extends it with some useful methods. See https://docs.python.org/3/library/collections.html#collections.Counter

Upvotes: 1

furas
furas

Reputation: 142631

For single words you would need only set() (without defaultdict)

unique_words = list(set(data))

print(unique_words)

For two words you can use for-loop with data[i:i+2] and len(data)-1 (without defaultdict)

all_bigrams = []

for i in range(len(data)-1):
    all_bigrams.append( tuple(data[i:i+2]) )
    
unique_bigrams = list(set(all_bigrams))

print(unique_bigrams)

or using directly set() without all_bigrams

unique_bigrams = set()

for i in range(len(data)-1):
    unique_bigrams.add( tuple(data[i:i+2]) )
    
unique_bigrams = list(unique_bigrams)

print(unique_bigrams)

The same for three words but with data[i:i+3] and len(data)-2

all_threewords = []

for i in range(len(data)-2):
    all_threewords.append( tuple(data[i:i+3]) )
    
unique_threewords = list(set(all_threewords))

print(unique_threewords)

or using directly set() without all_threewords

unique_threewords = set()

for i in range(len(data)-2):
    unique_threewords.add( tuple(data[i:i+3]) )
    
unique_threewords = list(unique_threewords)

print(unique_threewords)

Full working example


data = ['AT_TOKEN',
 'what',
 'AT_TOKEN',
 'said',
 'END',
 'AT_TOKEN',
 'plus',
 'you',
 've',
 'added',
 'commercials',
 'to',
 'the',
 'experience',
 'tacky',
 'END',
 'AT_TOKEN',
 'i',
 'did',
 'nt',
 'today',
 'must',
 'mean',
 'i',
 'need',
 'to',
 'take',
 'another',
 'trip',
 'END']

# ---

unique_words = list(set(data))

print(unique_words)

# ---

all_bigrams = []

for i in range(len(data)-1):
    all_bigrams.append( tuple(data[i:i+2]) )
    
unique_bigrams = list(set(all_bigrams))

print(unique_bigrams)

# ---

unique_bigrams = set()

for i in range(len(data)-1):
    unique_bigrams.add( tuple(data[i:i+2]) )
    
unique_bigrams = list(unique_bigrams)

print(unique_bigrams)

# ---

all_threewords = []

for i in range(len(data)-2):
    all_threewords.append( tuple(data[i:i+3]) )
    
unique_threewords = list(set(all_threewords))

print(unique_threewords)

# ---

unique_threewords = set()

for i in range(len(data)-2):
    unique_threewords.add( tuple(data[i:i+3]) )
    
unique_threewords = list(unique_threewords)

print(unique_threewords)

But I don't know if you need pairs like ('END', 'AT_TOKEN') or any pair with 'END' or 'AT_TOKEN'.

It would need first convert to sublists

data = [
    
  ['AT_TOKEN', 'what'],
    
  ['AT_TOKEN', 'said', 'END'], 

  ['AT_TOKEN', 'plus', 'you', 've', 'added',
   'commercials', 'to', 'the', 'experience',
   'tacky', 'END'],
  
  ['AT_TOKEN', 'i', 'did', 'nt', 'today',
   'must', 'mean', 'i', 'need', 'to', 'take',
   'another', 'trip', 'END']
  
]  

and later work with every sublist separatelly.

Upvotes: 0

Related Questions