Reputation: 436
I am currently working on a java web server project, that requires the use of Natural Language processing, specifically Named Entity Recognition (NER).
I was using OpenNLP for java, since it was easy to add custom training data. It works perfectly.
However, I need to also be able to extract entites inside of entities (Nested named entity recognition). I tried doing this in OpenNLP, but I got parsing errors. So my guess is that OpenNLP sadly does not support nested entities.
Here is an example of what I need to parse:
Remind me to [START:reminder] give some presents to [START:contact] John [END] and [START:contact] Charlie [END][END].
If this cannot be achieved with OpenNLP, is there any other Java NLP Library that could do this. If there are no Java libraries at all, are there any NLP libraries in any other language that can do this?
Please help. Thanks!
Upvotes: 2
Views: 1065
Reputation: 881
The short answer is:
I think you are extending too much the concept of entity, which is habitually associated with persons, places, organizations, gene names etc. But not with the identification of complex structures within text.
For that purpose you need to think in a more elaborated solution, taking into account the grammatical structure of the sentence, which can be obtained using a parser like the one in OpenNLP, and maybe combine this with the output of the NER process.
Upvotes: 1
Reputation: 5566
Use this python source code (Python 3) https://gist.github.com/ttpro1995/cd8c60cfc72416a02713bb93dff9ae6f
It's create multiple un-nest version of nest data for you.
For input sentence below ( input data must be tokenized first, so there are space between and thing around it)
Remind me to <START:reminder> give some presents to <START:contact> John <END> and <START:contact> Charlie <END> <END> .
It output multiple sentence with different nest level.
Remind me to give some presents to John and Charlie .
Remind me to <START:reminder> give some presents to John and Charlie <END> .
Remind me to give some presents to <START:contact> John <END> and <START:contact> Charlie <END> .
Full source code here for quick copy-paste
import sys
END_TAG = 0
START_TAG = 1
NOT_TAG = -1
def detect_tag(in_token):
"""
detect tag in token
:param in_token:
:return:
"""
if "<START:" in in_token:
return START_TAG
elif "<END>" == in_token:
return END_TAG
return NOT_TAG
def remove_nest_tag(in_str):
"""
với <START:ORGANIZATION> Sở Cảnh sát Phòng cháy , chữa cháy ( PCCC ) và cứu nạn , cứu hộ <START:LOCATION> Hà Nội <END> <END>
:param in_str:
:return:
"""
state = 0
taglist = []
tag_dict = dict()
sentence_token = in_str.split()
## detect token tag
max_nest = 0
for index, token in enumerate(sentence_token):
# print(token + str(detect_tag(token)))
tag = detect_tag(token)
if tag > 0:
state += 1
if max_nest < state:
max_nest = state
token_info = (index, state, token)
taglist.append(token_info)
tag_dict[index] = token_info
elif tag == 0:
token_info = (index, state, token)
taglist.append(token_info)
tag_dict[index] = token_info
state -= 1
generate_sentences = []
for state in range(max_nest+1):
generate_sentence_token = []
for index, token in enumerate(sentence_token):
if detect_tag(token) >= 0: # is a tag
token_info = tag_dict[index]
if token_info[1] == state:
generate_sentence_token.append(token)
elif detect_tag(token) == -1 : # not a tag
generate_sentence_token.append(token)
sentence = ' '.join(generate_sentence_token)
generate_sentences.append(sentence)
return generate_sentences
# generate sentence
print(taglist)
def test():
tstr2 = "Remind me to <START:reminder> give some presents to <START:contact> John <END> and <START:contact> Charlie <END> <END> ."
result = remove_nest_tag(tstr2)
print("-----")
for sentence in result:
print(sentence)
if __name__ == "__main__":
"""
un-nest dataset for opennlp name
"""
# test()
# test()
if len(sys.argv) > 1:
inpath = sys.argv[1]
infile = open(inpath, 'r')
outfile = open(inpath+".out", 'w')
for line in infile:
sentences = remove_nest_tag(line)
for sentence in sentences:
outfile.write(sentence+"\n")
outfile.close()
else:
print("usage: python unnest_data.py input.txt")
Upvotes: 0
Reputation: 109
For the purpose of Name Entity Recognition (Java based) I use the following:
https://github.com/merishav/cleartk-tutorials
You can train models for your use case, I have already trained for NER for person, places, date of birth, profession. ClearTK gives you a wrapper on MalletCRFClassifier.
Upvotes: 1