Reputation: 12141
I have a database of items with tags, such as:
item1
is tagged with "pork with apple sauce"
item2
is tagged with "pork"
, item3
is tagged with "apple sauce"
. If I match the string:
"Today I would like to eat pork with apple sauce, it would fill me up"
against the tags, I would get three results. However, I just want to get the most specific one, which in this case would be item1
.
This is just an example and i'm not using a particular database. Just string and map in ruby. I came up with "fuzzy search". I'm not sure if this is correct. Can anybody suggest how to solve this particular problem?
Upvotes: 1
Views: 257
Reputation: 2381
This will apply to general programming and not Ruby in particular.
I would tokenize both Strings, that is both the needle and the haystack and then loop trough them both while counting number of occurens. Then finally compare scores.
Some sudo code:
needle[] = array of tokens from keysentence
haystack[] array of tokens from search string
int score = 0
do {
haystackToken = haystack's next token
do {
needleToken = needle's next token
if (haystackToken equals needleToken)
score += 1
} while(needle has more token)
} while (haystack has more tokens)
Upvotes: 0
Reputation: 3965
Depending on your specific use case, you may not need a fuzzy search.
Maybe a very basic implementation like this is sufficient for you:
class Search
attr_reader :items, :string
def initialize(items, string)
@items = items
@string = string.downcase
end
def best_match
items.max_by { |item| rate(item) }
end
private
def rate(item)
tag_list(item).count { |tag| string.include?(tag) }
end
def tag_list(item)
item[:tags].split(" ")
end
end
items = [
{ id: :item1, tags: "pork with apple sauce" },
{ id: :item2, tags: "pork" },
{ id: :item3, tags: "apple sauce" }
]
string = "Today I would like to eat pork with apple sauce, it would fill me up"
Search.new(items, string).best_match
#=> {:id=>:item1, :tags=>"pork with apple sauce"}
Upvotes: 2
Reputation: 168091
The order or specifity among the items in your database is determined before you match them with a string. You do not make it clear in the question, but I suppose what you have in mind is the length. So, suppose you have the data as a hash:
h = {
item1: "pork with apple sauce",
item2: "pork",
item3: "apple sauce",
}
Then, you can sort this by the length of the tag so that a longer one comes first in the list. At the same time, you can convert the tags into regexes so that you don't need to worry about variation in space. Then, you would have an array like this:
a =
h
.sort_by{|_, s| s.length}.reverse
.map{|k, s| [k, Regexp.new("\\b#{s.gsub(/\s+/, '\\s+')}\\b")]}
# =>
# [
# [
# :item1,
# /\bpork\s+with\s+apple\s+sauce\b/
# ],
# [
# :item3,
# /\bapple\s+sauce\b/
# ],
# [
# :item2,
# /\bpork\b/
# ]
# ]
Once you have this, you just need to find the first item in the list that matches with the string.
s = "Today I would like to eat pork with apple sauce, it would fill me up"
a.find{|_, r| s =~ r}[0]
# => :item1
Upvotes: 1
Reputation: 19219
Yes, you need to do a fuzzy match (aka approximate match). It is quite a well known problem, and implementing an approximate matching algorithm by hand is not an easy task (but I'm sure it's very fun! =D). There are lots of things that can affect how "similar" two strings, A and B, are, depending on what things you consider important, like how many times A appears in B, or how close the order and distance between the words in A appear in B, or if the "important" words in A appear in B, etc.
If you can get by with an existing library, there seems to be a couple of Ruby gems that can get the job done. For example, using this one called fuzzy-string-match, which uses the Jaro-Winkler distance ported from Lucene (a Java library... it also seems to have preserved the Java convention of camelCased method names ¬¬):
require 'fuzzystringmatch'
matcher = FuzzyStringMatch::JaroWinkler.create(:pure)
tags = ["pork with apple sauce", "pork", "apple sauce"]
input = "Today I would like to eat pork with apple sauce, it would fill me up"
# Select the tag by distance to the input string (distance == 1 means perfect
# match)
best_tag = tags.max_by { |tag| matcher.getDistance(tag, input) }
p best_tag
Will correctly select "pork with apple sauce"
.
There's also this other gem called amatch that has many other approximate matching algorithms.
Upvotes: 3