Python: Find vocabulary of a bigram

Question

I have a list of tweets (tokenized and preprocessed). It's like this:

['AT_TOKEN',
 'what',
 'AT_TOKEN',
 'said',
 'END',
 'AT_TOKEN',
 'plus',
 'you',
 've',
 'added',
 'commercials',
 'to',
 'the',
 'experience',
 'tacky',
 'END',
 'AT_TOKEN',
 'i',
 'did',
 'nt',
 'today',
 'must',
 'mean',
 'i',
 'need',
 'to',
 'take',
 'another',
 'trip',
 'END']

END signifies that a tweet has ended and a new one has begun.

I want to find the bigram vocabulary for this list but having a hard time how can I do it efficiently. I have figured out how I can do this for a unigram like this:

unique_words = defaultdict(int)
for i in range(len(data)):
    unique_words[data[i]] = 1
return list(unique_words.keys())

Problem is that I need to first convert this list into bigram and then find the vocabulary for that bigram.

Can anybody help me figure this out?

furas · Accepted Answer

For single words you would need only set() (without defaultdict)

unique_words = list(set(data))

print(unique_words)

For two words you can use for-loop with data[i:i+2] and len(data)-1 (without defaultdict)

all_bigrams = []

for i in range(len(data)-1):
    all_bigrams.append( tuple(data[i:i+2]) )
    
unique_bigrams = list(set(all_bigrams))

print(unique_bigrams)

or using directly set() without all_bigrams

unique_bigrams = set()

for i in range(len(data)-1):
    unique_bigrams.add( tuple(data[i:i+2]) )
    
unique_bigrams = list(unique_bigrams)

print(unique_bigrams)

The same for three words but with data[i:i+3] and len(data)-2

all_threewords = []

for i in range(len(data)-2):
    all_threewords.append( tuple(data[i:i+3]) )
    
unique_threewords = list(set(all_threewords))

print(unique_threewords)

or using directly set() without all_threewords

unique_threewords = set()

for i in range(len(data)-2):
    unique_threewords.add( tuple(data[i:i+3]) )
    
unique_threewords = list(unique_threewords)

print(unique_threewords)

Full working example


data = ['AT_TOKEN',
 'what',
 'AT_TOKEN',
 'said',
 'END',
 'AT_TOKEN',
 'plus',
 'you',
 've',
 'added',
 'commercials',
 'to',
 'the',
 'experience',
 'tacky',
 'END',
 'AT_TOKEN',
 'i',
 'did',
 'nt',
 'today',
 'must',
 'mean',
 'i',
 'need',
 'to',
 'take',
 'another',
 'trip',
 'END']

# ---

unique_words = list(set(data))

print(unique_words)

# ---

all_bigrams = []

for i in range(len(data)-1):
    all_bigrams.append( tuple(data[i:i+2]) )
    
unique_bigrams = list(set(all_bigrams))

print(unique_bigrams)

# ---

unique_bigrams = set()

for i in range(len(data)-1):
    unique_bigrams.add( tuple(data[i:i+2]) )
    
unique_bigrams = list(unique_bigrams)

print(unique_bigrams)

# ---

all_threewords = []

for i in range(len(data)-2):
    all_threewords.append( tuple(data[i:i+3]) )
    
unique_threewords = list(set(all_threewords))

print(unique_threewords)

# ---

unique_threewords = set()

for i in range(len(data)-2):
    unique_threewords.add( tuple(data[i:i+3]) )
    
unique_threewords = list(unique_threewords)

print(unique_threewords)

But I don't know if you need pairs like ('END', 'AT_TOKEN') or any pair with 'END' or 'AT_TOKEN'.

It would need first convert to sublists

data = [
    
  ['AT_TOKEN', 'what'],
    
  ['AT_TOKEN', 'said', 'END'], 

  ['AT_TOKEN', 'plus', 'you', 've', 'added',
   'commercials', 'to', 'the', 'experience',
   'tacky', 'END'],
  
  ['AT_TOKEN', 'i', 'did', 'nt', 'today',
   'must', 'mean', 'i', 'need', 'to', 'take',
   'another', 'trip', 'END']
  
]

and later work with every sublist separatelly.

Python: Find vocabulary of a bigram

Answers (2)

Related Questions