Reputation:
I have a text file I'm processing and I want to tokenize each word, but keep names together e.g. 'John Smith'.
I want to use nltk.bigrams to do this, if I use this and get a list of bigrams how would I search that list for bigrams where both words start with a capital letter?
bigrams = list(nltk.bigrams(text))
Upvotes: 0
Views: 212
Reputation: 29
list(filter(lambda L : L[0][0].upper() == L[0][0] and L[1][0].upper() == L[1][0], list(bigrams(text))))
Edit:
As an explanation, list(filter(lambda x : f(x), my_list))
filters my_list
by values for which f(x) == True
. Here, I filtered the list list(bigrams(text))
by values for which both words starts with an uppercase.
(Since an element L
of list(bigrams(text))
is a tuple of two words, I check if L[0]
and L[1]
first letter is a capital letter.)
Upvotes: 1
Reputation: 1650
If this is what you want
import nltk
nltk.download('punkt')
text = "My name is Shaida Muhammad and I'm not an extremist"
text = nltk.word_tokenize(text)
bigrams = list(nltk.bigrams(text))
for first_word, second_word in bigrams:
if first_word.istitle() and second_word.istitle():
print(first_word, second_word) # It will output Shaida Muhammad
Upvotes: 0
Reputation: 260600
IIUC, you want to break your sentence into words, but keep the names (two consecutive words starting with capital) together?
You can use a small regex:
text = 'sentence where John Smith and Jane Doe are mentioned, here a Capital word alone'
re.findall('[A-Z]\w+\s[A-Z]\w+|\w+', text)
output:
['sentence',
'where',
'John Smith',
'and',
'Jane Doe',
'are',
'mentioned',
'here',
'a',
'Capital',
'word',
'alone']
[list(nltk.bigrams(x)) for x in re.findall('[A-Z]\w+\s[A-Z]\w+|\w+', text)]
output:
[[('s', 'e'),
('e', 'n'),
('n', 't'),
('t', 'e'),
('e', 'n'),
('n', 'c'),
('c', 'e')],
[('w', 'h'), ('h', 'e'), ('e', 'r'), ('r', 'e')],
[('J', 'o'),
('o', 'h'),
('h', 'n'),
('n', ' '),
(' ', 'S'),
('S', 'm'),
('m', 'i'),
('i', 't'),
('t', 'h')],
[('a', 'n'), ('n', 'd')],
[('J', 'a'),
('a', 'n'),
('n', 'e'),
('e', ' '),
(' ', 'D'),
('D', 'o'),
('o', 'e')],
[('a', 'r'), ('r', 'e')],
[('m', 'e'),
('e', 'n'),
('n', 't'),
('t', 'i'),
('i', 'o'),
('o', 'n'),
('n', 'e'),
('e', 'd')],
[('h', 'e'), ('e', 'r'), ('r', 'e')],
[],
[('C', 'a'), ('a', 'p'), ('p', 'i'), ('i', 't'), ('t', 'a'), ('a', 'l')],
[('w', 'o'), ('o', 'r'), ('r', 'd')],
[('a', 'l'), ('l', 'o'), ('o', 'n'), ('n', 'e')]]
Upvotes: 0