Reputation: 53
https://github.com/RasaHQ/rasa_nlu/issues/1468#issue-370187480
Rasa NLU version:0.13.6
Operating system (windows, osx, ...):windows
Content of model configuration file: yml
language: "en"
pipeline:
- name: tokenizer_whitespace
- name: intent_entity_featurizer_regex
- name: ner_crf
- name: ner_synonyms
- name: intent_featurizer_count_vectors
- name: intent_classifier_tensorflow_embedding
intent_tokenization_flag: true
intent_split_symbol: "+"
path: ./models/nlu
data: ./data/training_nlu.json
Issue:
how to extract entity. which are not adjacent words. below is an example:
i need to train my NLU to understand public grievances, like STREET LIGHT OUT,POTHOLE IN STREET,STREET LIGHTS ON DAYS
My entity value is STREET LIGHT OUT , that means a person wants to report a Street light not working. he/she will do it in below format .
The Street light adjacent to Dr Vasanth Shetty's Clinic , WH Hanumanthappa Layout, Ulsoor Road, Bangalore 42 has been fused since a week.
Street light alone is not an entity or fused alone is not my entity. street light fused is a synonym . Is it possible , to train NLU to extract street light fused from this sentence. if yes how.
if no , is splitting Street Light and fused as different entities is the only solution? but it may be possible to extract street light fused from above sentence because it can extract entities which multiple words in it and tokenizer_whitespace just break at an white space .
Please suggest is there a better way to get my entity without splitting into multiple entities.
here i have more example on the same issue:
Example 1:
Garbage not picked from past 10 days,need immediate attention for clearance.
here i can pick out Garbage not picked is the issue. i can train my NLU to extract this named entity with ner_crf with below training snippet
{
"text": "Garbage not picked from past 10 days,need immediate attention for clearance",
"intent": "inform_grevience",
"entities": [
{
"start": 20,
"end": 38,
"value": "Garbage not picked",
"entity": "issue"
}
]
}
Example 2:
A Garbage bin near 10th main is not picked from past 10 days , immediate action required
different citizen is reporting same problem but different sentence.
can i extract Garbage not picked from Example 2 as well using ner_crf ?
Upvotes: 2
Views: 1540
Reputation: 2161
I'm going to propose two alternate approaches, both relying on intents. I believe the only entity in the utterance you provided is the address information.
So you can train each of your examples as completely different intents (excluding entities for somplicity):
## intent:streetLightOut
- The Street light adjacent to Dr Vasanth Shetty's Clinic , WH Hanumanthappa Layout, Ulsoor Road, Bangalore 42 is out.
- I'd like to report a street light that is burnt out
- street light out
## intent:streetLightAlwaysOn
- The Street light adjacent to Dr Vasanth Shetty's Clinic , WH Hanumanthappa Layout, Ulsoor Road, Bangalore 42 is always on.
- I'd like to report a street light that never turns off
- street light on constantly
## intent:potholeInStreet
- There's a pothole at the intersection of 10th and main
- pothole
- pothole on 11th street near Wal-Mart
Alternatively since you are using tensor flow you could use heirarcachal intents:
## intent:streetLight+out
- The Street light adjacent to Dr Vasanth Shetty's Clinic , WH Hanumanthappa Layout, Ulsoor Road, Bangalore 42 is out.
- I'd like to report a street light that is burnt out
- street light out
## intent:streetLight+alwaysOn
- The Street light adjacent to Dr Vasanth Shetty's Clinic , WH Hanumanthappa Layout, Ulsoor Road, Bangalore 42 is always on.
- I'd like to report a street light that never turns off
- street light on constantly
## intent:potHole
- There's a pothole at the intersection of 10th and main
- pothole
- pothole on 11th street near Wal-Mart
My main reason for suggesting these approaches is that entities in Rasa are highly positional with little importance placed on the word (and no inclusion of word vectors). Since all problems with street lights are likely to include those words or similar words it seems the word themselves hold the most value.
This blog post has some info on TensforFlow and hierarchal intents: https://medium.com/rasa-blog/supervised-word-vectors-from-scratch-in-rasa-nlu-6daf794efcd8
Upvotes: 3