Reputation: 8335
I want to match similar strings with same significant word.
Problem:
I have two files one master and one input file. I have to iterate through the input file and find similar record from master. Currently I have indexed the master file in ElasticSearch and try to get similar records from ElasticSearch but since the Master contains of many similar records it return many records and finding the appropriate one from them is the problem.
Sample Input record:
1. H1 Bulbs Included
Sample Output From ElasticSearch:
1. Included H1 [Correct One]
2. H7 Bulbs Included
3. H8 Bulbs Provided
4. H1 not Included[Should not match this]
I have tried using POS tagger to get the important terms but it does not work well.
POS Tagger Output:
1. H1/NNP Included/NNP
2. H8/NNP Bulbs/NNP Provided/NNP
How to proceed with this?
Edit:
In the above example H1 is the significant term
Sample Input Record:
1. H1 Bulbs included
Sample Output from ElasticSearch:
1. H2 Bulbs Included
2. H3 Bulbs Included
3. H1 [Correct One]
Initially I need to identify the Significant word. There is currently no pattern in the significant word.
i.e.)
1.H1 bulbs [H1]
2.9600 added [9600]
3.It has H8 [H8]
4.1/2 wire for 4500 bulb [4500]
Upvotes: 0
Views: 59
Reputation: 1989
I'm not familiar with elasticsearch, but doing this but using standard python should be straightforward. From your criteria above it's not clear which are the really significant words in 'H1' 'Included' and 'Bulbs' and what the processing criteria are, but as a simple case:
inputstr = 'H1 Bulbs Included'
keywords = ('H1','Bulbs','Included')
result = [x for x in keywords if x in inputstr]
>>> ['H1','Bulbs','Included']
alternatively, if you want to do some maths on it you could do
result = [bool(x) for x in keywords if x in inputstr]
>>> [True,True,True]
sum(result)
>>> 3
and then if some words are super critical, you can just use multiply for the critical words, if you need 2 out of 3 you can just check the sum, etc
for filtering out 'not', you can just check 'not' not in inputstr, ie
result = 'not' not in inputstr * result
>>> True
Upvotes: 1