Zia
Zia

Reputation: 416

How to find and match each elements of a list on each sentences?

I have a file including some sentences. I used polyglot for Named Entity Recognition and stored all detected entities in a list. Now I want to check if in each sentence any or pair of entities exist, show that for me.

Here what I did:

from polyglot.text import Text

file = open('input_raw.txt', 'r')
input_file = file.read()
test = Text(input_file, hint_language_code='fa')

list_entity = []
for sent in test.sentences:
    #print(sent[:10], "\n")
    for entity in test.entities:
       list_entity.append(entity)

for i in range(len(test)):
    m = test.entities[i]
    n = test.words[m.start: m.end] # it shows only word not tag
    if str(n).split('.')[-1] in test: # if each entities exist in each sentence
         print(n)

It gives me an empty list.

Input:

 sentence1: Bill Gate is the founder of Microsoft.
 sentence2: Trump is the president of USA.

Expected output:

Bill Gate, Microsoft
Trump, USA

Output of list_entity:

I-PER(['Trump']), I-LOC(['USA'])

How to check if I-PER(['Trump']), I-LOC(['USA']) is in first sentence?

Upvotes: 0

Views: 604

Answers (1)

lucasgcb
lucasgcb

Reputation: 1068

For starters you were adding the whole text file input to the entities list. entities can only be called by each sentence in the polyglot object.

from polyglot.text import Text
file = open('input_raw.txt', 'r')
input_file = file.read()
file = Text(input_file, hint_language_code='fa')

list_entity = []
for sentence in file.sentences:
    for entity in sentence.entities:
        #print(entity)
        list_entity.append(entity)

print(list_entity)

Now you don't have an empty list.


As for your problem with identifying the identity terms,

I have not found a way to generate an entity by hand, so the following simply checks if there are entities with the same term. A Chunk can have multiple strings inside, so we can go through them iteratively.

from polyglot.text import Text
file = open('input_raw.txt', 'r')
input_file = file.read()
file = Text(input_file, hint_language_code='ar')

def check_sentence(entities_list, sentence): ## Check if string terms 
    for term in entities_list:               ## are in any of the entities
        ## Compare each Chunk in the list to each Chunk 
        ## object  in the sentence and see if there's any matches.
        if any(any(entityTerm == term for entityTerm in entityObject) 
               for entityObject in sentence.entities):
            pass
        else:
            return False
    return True

sentence_number = 1 # Which sentence to check
sentence = file.sentences[sentence_number]
entity_terms = ["Bill", 
                "Gates"]

if check_sentence(entity_terms, sentence):
    print("Entity Terms " + str(entity_terms) +  
          " are in the sentence. '" + str(sentence)+ "'")
else:
    print("Sentence '" + str(sentence) + 
          "' doesn't contain terms" + str(entity_terms ))

Once you find a way to generate arbitrary entities all you'll have to do is stop popping the term from the sentence checker so you can do type comparison as well.


If you just want to match the list of entities in the file against a specific sentence, then this should do the trick:

from polyglot.text import Text
file = open('input_raw.txt', 'r')
input_file = file.read()
file = Text(input_file, hint_language_code='fa')

def return_match(entities_list, sentence): ## Check if and which chunks
    matches = []                           ## are in the sentence
    for term in entities_list:                  
        ## Check each list in each Chunk object 
        ## and see if there's any matches.
        for entity in sentence.entities:
            if entity == term:
                for word in entity:
                    matches.append(word)
    return matches

def return_list_of_entities(file):
    list_entity = []
    for sentence in file.sentences:
        for entity in sentence.entities:
            list_entity.append(entity)
    return list_entity

list_entity = return_list_of_entities(file)
sentence_number = 1 # Which sentence to check
sentence = file.sentences[sentence_number]
match = return_match(list_entity, sentence)

if match:
    print("Entity Term " + str(match) +  
          " is in the sentence. '" + str(sentence)+ "'")
else:
    print("Sentence '" + str(sentence) + 
          "' doesn't contain any of the terms" + str(list_entity))

Upvotes: 1

Related Questions