Reputation: 441
I have a MySQL database. I store homes in the database and perform literally just 1 query against the database, but I need this query to be performed super fast, and that's to return all homes within a square box geo latitude & longitude.
SELECT * FROM homes
WHERE geolat BETWEEN ??? AND ???
AND geolng BETWEEN ??? AND ???
How is the best way for me to store my geo data so that I can perform this query of displaying all home within the geolocation box the quickest?
Basically:
In case it helps, I've include my database table schema below:
CREATE TABLE IF NOT EXISTS `homes` (
`home_id` int(10) unsigned NOT NULL auto_increment,
`address` varchar(128) collate utf8_unicode_ci NOT NULL,
`city` varchar(64) collate utf8_unicode_ci NOT NULL,
`state` varchar(2) collate utf8_unicode_ci NOT NULL,
`zip` mediumint(8) unsigned NOT NULL,
`price` mediumint(8) unsigned NOT NULL,
`sqft` smallint(5) unsigned NOT NULL,
`year_built` smallint(5) unsigned NOT NULL,
`geolat` decimal(10,6) default NULL,
`geolng` decimal(10,6) default NULL,
PRIMARY KEY (`home_id`),
KEY `geolat` (`geolat`),
KEY `geolng` (`geolng`),
) ENGINE=InnoDB ;
UPDATE
I understand spatial will factor in the curvature of the earth but I'm most interested in returning geo data the FASTEST. Unless these spatial database packages somehow return data faster, please don't recommend spatial extensions. Thanks
UPDATE 2
Please note, no one below has truly answered the question. I'm really looking forward to any assistance I might receive. Thanks in advance.
Upvotes: 44
Views: 49419
Reputation: 76590
There is a good paper on MySQL geolocation performance here.
EDIT Pretty sure this is using fixed radius. Also I am not 100% certain the algorithm for calculating distance is the most advanced (i.e. it'll "drill" through Earth).
What's significant is that the algorithm is cheap to give you a ball park limit on the number of rows to do proper distance search.
The algorithm pre-filters by taking candidates in a square around the source point, then calculating the distance in miles.
Pre-calculate this, or use a stored procedure as the source suggests:
# Pseudo code
# user_lon and user_lat are the source longitude and latitude
# radius is the radius where you want to search
lon_distance = radius / abs(cos(radians(user_lat))*69);
min_lon = user_lon - lon_distance;
max_lon = user_lon + lon_distance;
min_lat = user_lat - (radius / 69);
max_lat = user_lat + (radius / 69);
SELECT dest.*,
3956 * 2 * ASIN(
SQRT(
POWER(
SIN(
(user_lat - dest.lat) * pi() / 180 / 2
), 2
) + COS(
user_lat * pi() / 180
) * COS(
dest.lat * pi() / 180
) * POWER(
SIN(
(user_lon - dest.lon) * pi() / 180 / 2
), 2
)
)
) as distance
FROM dest
WHERE
dest.lon between min_lon and max_lon AND
dest.lat between min_lat and max_lat
HAVING distance < radius
ORDER BY distance
LIMIT 10
Upvotes: 14
Reputation: 31
Since MySQL 5.7 mysql can use geoindex like ST_Distance_Sphere() and ST_Contains() wich improve performances.
Upvotes: 1
Reputation: 1
You might consider creating a separate table 'GeoLocations' that has a primary key of ('geolat','geolng') and has a column that holds the home_id if that particular geolocation happens to have a home. This should allow the optimizer to search for a range of geo locations that will be sorted on disk for a list of home_ids. You could then perform a join with your 'homes' table to find information about those home_ids.
CREATE TABLE IF NOT EXISTS `GeoLocations` (
`geolat` decimal(10,6) NOT NULL,
`geolng` decimal(10,6) NOT NULL,
`home_id` int(10) NULL
PRIMARY KEY (`geolat`,`geolng`)
);
SELECT GL.home_id
FROM GeoLocations GL
INNER JOIN Homes H
ON GL.home_id = H.home_id
WHERE GL.geolat between X and Y
and GL.geolng between X and Y
Upvotes: 0
Reputation: 99816
I had the same problem, and wrote a 3 part blogpost. This was faster than the geo index.
Upvotes: 6
Reputation: 5402
If you really need to go for performance you can define bounding boxes for your data and map the pre-compute bounding boxes to your objects on insertion and use them later for queries.
If the resultsets are reasonably small you could still do accuracy corrections in the application logic (easier to scale horizontal than a database) while enabling to serve accurate results.
Take a look at Bret Slatkin's geobox.py which contains great documentation for the approach.
I would still recommend checking out PostgreSQL and PostGIS in comparison to MySQL if you intend to do more complex queries in the foreseeable future.
Upvotes: 2
Reputation: 19
Here's a trick I've used with some success is to create round-off regions. That is to say, if you have a location that's at 36.12345,-120.54321, and you want to group it with other locations which are within a half-mile (approximate) grid box, you can call its region 36.12x-120.54, and all other locations with the same round-off region will fall in the same box.
Obviously, that doesn't get you a clean radius, i.e. if the location you're looking at is closer to one edge than another. However, with this sort of a set-up, it's easy enough to calculate the eight boxes that surround your main location's box. To wit:
[36.13x-120.55][36.13x-120.54][36.13x-120.53]
[36.12x-120.55][36.12x-120.54][36.12x-120.53]
[36.11x-120.55][36.11x-120.54][36.11x-120.53]
Pull all the locations with matching round-off labels and then, once you've got them out of the database, you can do your distance calculations to determine which ones to use.
Upvotes: 1
Reputation: 1551
Sticking with your current approach there is one change you should make, Rather than indexing geolat and geolong separately you should have a composite index:
KEY `geolat_geolng` (`geolat`, `geolng`),
Currently your query will only be taking advantage of one of the two indexes.
Upvotes: 1
Reputation: 1142
Homes? You probably won't even have ten thousand of them. Just use an in-memory index like STRTree.
Upvotes: 0
Reputation: 4535
This looks pretty fast. My only concern would be that it would use an index to get all the values within 3 miles of the latitude, then filter those for values within 3 miles of the longitude. If I understand how the underlying system works, you can only use one INDEX per table, so either the index on lat or long is worthless.
If you had a large amount of data, it might speed things up to give every 1x1 mile square a unique logical ID, and then make an additional restriction on the SELECT that (area="23234/34234" OR area="23235/34234" OR ... ) for all the squares around your point, then force the database to use that index rather than the lat and long. Then you'll only be filtering much less square miles of data.
Upvotes: 0
Reputation: 10210
The indices you are using are indeed B-tree indices and support the BETWEEN
keyword in your query. This means that the optimizer is able to use your indices to find the homes within your "box". It does however not mean that it will always use the indices. If you specify a range that contains too many "hits" the indices will not be used.
Upvotes: 1