user11096730
user11096730

Reputation:

Finding names in a list of bigrams?

I have a text file I'm processing and I want to tokenize each word, but keep names together e.g. 'John Smith'.

I want to use nltk.bigrams to do this, if I use this and get a list of bigrams how would I search that list for bigrams where both words start with a capital letter?

bigrams = list(nltk.bigrams(text))

Upvotes: 0

Views: 212

Answers (3)

Artemiev
Artemiev

Reputation: 29

list(filter(lambda L : L[0][0].upper() == L[0][0] and L[1][0].upper() == L[1][0], list(bigrams(text))))

Edit: As an explanation, list(filter(lambda x : f(x), my_list)) filters my_list by values for which f(x) == True. Here, I filtered the list list(bigrams(text)) by values for which both words starts with an uppercase.

(Since an element L of list(bigrams(text)) is a tuple of two words, I check if L[0] and L[1] first letter is a capital letter.)

Upvotes: 1

Shaida Muhammad
Shaida Muhammad

Reputation: 1650

If this is what you want

import nltk
nltk.download('punkt')

text = "My name is Shaida Muhammad and I'm not an extremist"
text = nltk.word_tokenize(text)
bigrams = list(nltk.bigrams(text)) 


for first_word, second_word in bigrams:
    if first_word.istitle() and second_word.istitle():
        print(first_word, second_word)    # It will output Shaida Muhammad

Upvotes: 0

mozway
mozway

Reputation: 260600

IIUC, you want to break your sentence into words, but keep the names (two consecutive words starting with capital) together?

You can use a small regex:

text = 'sentence where John Smith and Jane Doe are mentioned, here a Capital word alone'
re.findall('[A-Z]\w+\s[A-Z]\w+|\w+', text)

output:

['sentence',
 'where',
 'John Smith',
 'and',
 'Jane Doe',
 'are',
 'mentioned',
 'here',
 'a',
 'Capital',
 'word',
 'alone']
applying bigrams
[list(nltk.bigrams(x)) for x in re.findall('[A-Z]\w+\s[A-Z]\w+|\w+', text)]

output:

[[('s', 'e'),
  ('e', 'n'),
  ('n', 't'),
  ('t', 'e'),
  ('e', 'n'),
  ('n', 'c'),
  ('c', 'e')],
 [('w', 'h'), ('h', 'e'), ('e', 'r'), ('r', 'e')],
 [('J', 'o'),
  ('o', 'h'),
  ('h', 'n'),
  ('n', ' '),
  (' ', 'S'),
  ('S', 'm'),
  ('m', 'i'),
  ('i', 't'),
  ('t', 'h')],
 [('a', 'n'), ('n', 'd')],
 [('J', 'a'),
  ('a', 'n'),
  ('n', 'e'),
  ('e', ' '),
  (' ', 'D'),
  ('D', 'o'),
  ('o', 'e')],
 [('a', 'r'), ('r', 'e')],
 [('m', 'e'),
  ('e', 'n'),
  ('n', 't'),
  ('t', 'i'),
  ('i', 'o'),
  ('o', 'n'),
  ('n', 'e'),
  ('e', 'd')],
 [('h', 'e'), ('e', 'r'), ('r', 'e')],
 [],
 [('C', 'a'), ('a', 'p'), ('p', 'i'), ('i', 't'), ('t', 'a'), ('a', 'l')],
 [('w', 'o'), ('o', 'r'), ('r', 'd')],
 [('a', 'l'), ('l', 'o'), ('o', 'n'), ('n', 'e')]]

Upvotes: 0

Related Questions