Reputation: 2739
I have a database of around 300,000 names and addresses. There's a lot of names that have been spelt slightly differently but have the same address. I've been trying to group such names together. Here's a sample of my data.
POST /_bulk
{ "index": { "_index": "test", "_type": "test" }}
{ "name":"SREE SAI MAHILA PODUPU SANGHAM", "address":"KSR PURAM", "city":"VIZIANAGARAM" }
{ "index": { "_index": "test", "_type": "test" }}
{ "name":"SREE ANJANEYA MAHILA PODUPU SANGAM", "address":"KSR PURAM", "city":"VIZIANAGARAM" }
{ "index": { "_index": "test", "_type": "test" }}
{ "name":"SREE BANGARAMMA MAHILA PODUPU SANGAM", "address":"KSR PURAM", "city":"VIZIANAGARAM" }
{ "index": { "_index": "test", "_type": "test" }}
{ "name":"SREE SAI MAHILA PODUPU SANGHAM", "address":"KSR PURAM", "city":"VIZIANAGARAM" }
{ "index": { "_index": "test", "_type": "test" }}
{ "name":"SRI SAI MAHILA PODUPU SANGAM", "address":"KSR PURAM", "city":"VIZIANAGARAM" }
{ "index": { "_index": "test", "_type": "test" }}
{ "name":"SRI BANGARAMMA MAHILA PODUPU SANGAM", "address":"KSR PURAM", "city":"VIZIANAGARAM" }
{ "index": { "_index": "test", "_type": "test" }}
{ "name":"SRI ANJANEYA MAHILA PODUPU SANGAM", "address":"KSR PURAM", "city":"VIZIANAGARAM" }
{ "index": { "_index": "test", "_type": "test" }}
{ "name":"SRI RAMA MAHILA PODUPU SANGAM", "address":"KSR PURAM", "city":"VIZIANAGARAM" }
{ "index": { "_index": "test", "_type": "test" }}
{ "name":"SRI PYDITHALLAMMA MAHIALA PODUPU SANGAM", "address":"KSR PURAM", "city":"VIZIANAGARAM" }
{ "index": { "_index": "test", "_type": "test" }}
{ "name":"SRI RAMA MAHILA PODUPU SANGHAM", "address":"KSR PURAM", "city":"VIZIANAGARAM" }
{ "index": { "_index": "test", "_type": "test" }}
{ "name":"SRI PYDIMAMBA MAHILA PODUPU SANGAM KANNAM", "address":"KSR PURAM", "city":"VIZIANAGARAM" }
{ "index": { "_index": "test", "_type": "test" }}
{ "name":"SRI PYDITHALAMMA MAHILA PODUPU SANGAM", "address":"KSR PURAM", "city":"VIZIANAGARAM" }
I get a very low match score when i try to fuzzy match a name. Here's an example of the query i'm using:
GET test/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": {
"query": "SREE BANGARAMMA MAHILA PODUPU SANGAM",
"fuzziness": 2,
"operator": "and"
}
}
}
]
}
}
}
When i query this small sample set, for SREE BANGARAMMA MAHILA PODUPU SANGAM
, i get a max_score
of 1.1982819
and the fuzzy matched document: SRI BANGARAMMA MAHILA PODUPU SANGAM
has a score
of 0.2869133
. That signals a 23%
match. There's a slight difference in both their first words: SRI
vs SREE
.
Both SRI
and SREE
show up quite a lot in my dataset. Those could be equated to a title such as Sir
. The last part of the query, MAHILA PODUPU SANGAM
also gets repeated a lot throughout my dataset. The only unique entity in the string would be BANGARAMMA
.
Would the Term Frequency/Inverse Document Frequence be the reason for the skewed results?
I do get the result i desire when i query this small sample set. But when i run this same query on my main 300,000 data set, i only get back the result that matches the document 100% and the fuzzy match doesn't show up.
I've tried using boost
, but that doesn't seem to yield the result i want either.
I was wondering if this problem is because of the low fuzzy match score. If the fuzzy match scores so low in just 12 data points in the sample set, it probably scores much lower when its compared to 300,000. I'd like to know how i could get the fuzzy match to show up when i query my main dataset. Frankly, i don't know what the problem seems to be. Could someone point me in the right direction about this.
The result of the sample set looks like this:
"hits": {
"total": 2,
"max_score": 1.1982819,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "AViGh5xU276qVT8pqAHz",
"_score": 1.1982819,
"_source": {
"name": "SREE BANGARAMMA MAHILA PODUPU SANGAM",
"address": "KSR PURAM",
"city": "VIZIANAGARAM"
}
},
{
"_index": "test",
"_type": "test",
"_id": "AViGh5xU276qVT8pqAH2",
"_score": 0.2869133,
"_source": {
"name": "SRI BANGARAMMA MAHILA PODUPU SANGAM",
"address": "KSR PURAM",
"city": "VIZIANAGARAM"
}
}
]
}
Upvotes: 1
Views: 1055
Reputation: 3209
I wouldn't rely on tf-idf and fuzzy queries to do what you need. Fuzzy queries max out at an edit-distance of 2. So, "sri" might match "sree", but not "shree".
Read up on the SimHash algorithm (a locality-sensitive hash function for strings :: meaning similar strings have hash values which are close to one-another).
If you add another field to your source data with a SimHash of the name before you index it, you can then use that value to constrain the range of "similar names" returned for a given address.
You're probably still going to need to do some manual deduplication work to get your list solid, but at least SimHashing names will make this process less painful (e.g. Sort by address, then by name-hash).
You may also decide to simply remove honorifics like "sri" from search indexing using a stopword filter (if it occurs 1000s of times in your collection, does it actually help you find people? Or does anyone search "sri" alone?)
I'd also recommend using a common subcontinent nickname/name-variant list (if you can find one) as a synonym list to normalize (e.g. Hari, Hariram => Hari)*
*If you find/create this list, please share it! Many projects need this!
Upvotes: 3
Reputation: 1071
Try below query
{
"query": {
"multi_match": {
"query": "SREE BANGARAMMA MAHILA PODUPU SANGAM",
'fuzziness': 2,
'prefix_length': 1
}
}
}
Upvotes: 1