Mark Grey
Mark Grey

Reputation: 10257

Sphinx Search - Multi-index search vs Client Program Aggregation

Looking for insight into the best approach to implementing a python client for Sphinx Search.

The dataset I am searching through is made up of profile content. All the profiles are organized geographically as locations using latitude and longitude. The profiles have many different attributes all stored in the database as TEXT associated with the right profile ID. Basically, the query procedure from a search standpoint would be to issue a geographic search that uses Haversign to find all ids that fall within a radius, and then use Sphinx to search through all these properties to find profiles whose published content are most relevant to the issued query.

The client for sphinx I've been working on so far uses several different indexes from sphinx, and runs separate queries. The python object first runs the location query, saves the ids that fall within the range, and then runs queries against all the other indexes, filtering only so that ids from the geographic set can be returned as valid results.

What I am wondering is if it would be more efficient to join the location data into the fulltext search index for sphinx and have sphinx handle all the querying, rather than structuring my client program that uses the api to "fall back" through the queries like this. Would there be any advantage to one large index that gathers all the data as one sphinx "document" rather than having the client be responsible for running additional queries and filtering?

Code posted below to give an idea of how the queries run:

def LocationQuery(self):    
    self.SetServer('127.0.0.1', 9312)
    self.SetMatchMode(SPH_MATCH_ALL)    

    self.SetGeoAnchor('latitude','longitude',float(math.radians(self._lat)), float(math.radians(self._lon)))
    self.SetLimits(0,1000)  

    self.SetFilterFloatRange('@geodist',float(0),self._radius,0)
    self.SetSortMode(SPH_SORT_EXTENDED, '@geodist asc')
    self._results = self.Query('loc', GEO_INDEX)
    for match in self._results['matches']:
            attrsdump = ''
            for attr in self._results['attrs']:
                attrname = attr[0]
                attrtype = attr[1]
                val = match['attrs'][attrname]
            self._ids_in_range.append(ProfileResult(match['id'],match['attrs']['@geodist']))
    #for obj in self._ids_in_range:
        #print obj.__repr__()

def DescriptionQuery(self):
    self.ResetFilters()
    self.SetSortMode(SPH_SORT_EXTENDED, 'profileid_attr asc')
    ids = []
    for obj in self._ids_in_range:
        ids.append(obj.profID) 

    self.SetFilter('profileid_attr', ids)
    self._results = self.Query(self._query, DESCRIPTION_INDEX)
    for match in self._results['matches']:
        for id_valid in self._ids_in_range:
            if match['id'] == id_valid.profID:
                self.ResultSet.append(id_valid)
    print 'Description Results: %s' % (len(self._results['matches']))                   
    print 'Total Results: %s' % (self.ResultSet.count())

These methods would be run in sequence, saving to the object the ids that are found.

Upvotes: 0

Views: 665

Answers (1)

Iaroslav Vorozhko
Iaroslav Vorozhko

Reputation: 1719

If I understand your clearly when it could work faster if you extend your DESCRIPTION_INDEX with latitude and longitude attributes. Instead of two queries you will have only one to description index.

Upvotes: 0

Related Questions