miloops
miloops

Reputation: 24

Invalid results when searching emails using elasticsearch with Tire and Ruby on Rails

I'm trying index and search by email using Tire and elasticsearch.

The problem is that if I search for: "[email protected]". I get strange results because of @ and . symbols. I "solved" by hacking the query string and adding "email:" before a string I suspect is a string. If I don't do that, when searching "[email protected]", I would get results as "[email protected]" or "[email protected]".

include Tire::Model::Search
include Tire::Model::Callbacks

settings :analysis =>{
          :analyzer => {
            :whole_email => {
              'tokenizer' => 'uax_url_email'
            }
          }
  } do
  mapping do
    indexes :id
    indexes :email, :analyzer => 'whole_email', :boost => 10
  end
end

def self.search(params)
  params[:query] = params[:query].split(" ").map { |x| x =~ EMAIL_REGEXP ? "email:#{x}" : x }.join(" ")
  tire.search(load: {:include => {'event' => 'organizer'}}, page: params[:page], per_page: params[:per_page] || 10) do
    query do
      boolean do
        must { string params[:query] } if params[:query].present?
        must { term :event_id, params[:event_id]  } if params[:event_id].present?
      end
    end
    sort do
      by :id, 'desc'
    end
  end
end

def to_indexed_json
  self.to_json
end

When searching with "email:" the analyzer works perfectly but without it, it search that string in email without the specified analyzer, getting lots of undesired results.

Upvotes: 0

Views: 3788

Answers (2)

kiruba
kiruba

Reputation: 336

Add the field to _all and try search with adding escape character(\) to special characters of emailid.

example:something\@example\.com

Upvotes: 2

ramseykhalaf
ramseykhalaf

Reputation: 3400

I think your issue is to do with the _all field. By default, all fields get indexed twice, once under their field name, and again, using a different analyzer, in the _all field.

If you send a query without specifying which field you are searching in, then it will be executed against the _all field. When you index your doc, the email fields content is indexed again under the _all field (to stop this set include_in_all: false in your mapping) where they are tokenized the standard way (split on @ and .). This means that unguided queries will give strange results.

The way I would fix this is to use a term query for the emails and make sure to specify the field to search on. A term query is faster as it doesn't have a query parsing step the query_string query has (which is why when you prefix the string with "email:" it goes to the right field, that's the query parser working). Also you don't need to specify a custom analyzer unless you are indexing a field that contains both free text and urls and emails. If the field only contains emails then just set index: not_analyzed and it will remain a single token. (You might want to have a custom analyzer that lowercases the email though.)

Make your search query like this:

"term": {
    "email": "[email protected]"
}

Good luck!

Upvotes: 3

Related Questions