Gerdofal
Gerdofal

Reputation: 83

Is there any way to optimize a complex math query in MySQL?

I've been researching this forever and I am starting to think I can't make this any more efficient, but I wanted to ask if anyone had any tips.

I'm running a query on millions of records to find all the x,y,z coordinates (these are stars) along a linear column from system a to system b with a given radius. I'm running through PHP with a lot of other work being done on the result set. I get results from the script in about 16 seconds. The query delay is about 7 of those 16 seconds.

The basic query logic is:

SELECT name, coordinates, and distance from end point
FROM stars
WHERE all stars are in a column of given radius between start and end points
ORDER BY distance from end point DESC

The where clause requires two separate calculations, they are this:

Where calculation 1:

Calculate if the stars are within the space of the column using constants and x,y,z

Where calculation 2:

Limit the column radius to a given figure.
(This where clause also performs similar calculations with the same constants and x,y,z.)

The math formulas in the where clauses can't really be changed, they are the formula needed for columnar calculation in 3D space.

The order by at the end of the query is absolutely necessary because the result set is too large for my script to hold in memory. I have to work with it in the proper order in the script.

The query is easiest to read as defined prior to variable substitution:

SELECT
    name,
    x,
    y,
    z,
    SQRT(
        pow(`x`-" . $bx . ",2)+
        pow(`y`-" . $by . ",2)+
        pow(`z`-" . $bz . ",2)
    ) d
FROM
    stars
WHERE
    (((`x`*$cx+`y`*$cy+`z`*$cz)-($constant_1))/($constant_2)) between 0 and 1
AND
    SQRT(((($ax + ((((`x`*$cx+`y`*$cy+`z`*$cz)-($constant_1))/($constant_2)) * $cx))-`x`)*(($ax + ((((`x`*$cx+`y`*$cy+`z`*$cz)-($constant_1))/($constant_2)) * $cx))-`x`))+((($ay + ((((`x`*$cx+`y`*$cy+`z`*$cz)-($constant_1))/($constant_2)) * $cy))-`y`)*(($ay + ((((`x`*$cx+`y`*$cy+`z`*$cz)-($constant_1))/($constant_2)) * $cy))-`y`))+((($az + ((((`x`*$cx+`y`*$cy+`z`*$cz)-($constant_1))/($constant_2)) * $cz))-`z`)*(($az + ((((`x`*$cx+`y`*$cy+`z`*$cz)-($constant_1))/($constant_2)) * $cz))-`z`)))
        <=$radius
ORDER BY
    SQRT(
        pow(`x`-" . $bx . ",2)+
        pow(`y`-" . $by . ",2)+
        pow(`z`-" . $bz . ",2)
    ) DESC

The final query run on the database looks like this: (For simplicity, I'm using sample data where a lot of the constants are 0.)

SELECT
    name, 
    x, 
    y, 
    z, 
    SQRT( pow(`x`-25.21875,2)+ pow(`y`--20.90625,2)+ pow(`z`-25899.96875,2) ) d
FROM
    stars
WHERE
    (((`x`*25.21875+`y`*-20.90625+`z`*25899.96875)-(0))/(670809454.308)) 
    between 0 and 1
AND
    SQRT((((0 + ((((`x`*25.21875+`y`*-20.90625+`z`*25899.96875)-(0))/(670809454.308)) * 25.21875))-`x`)*((0 + ((((`x`*25.21875+`y`*-20.90625+`z`*25899.96875)-(0))/(670809454.308)) * 25.21875))-`x`))+(((0 + ((((`x`*25.21875+`y`*-20.90625+`z`*25899.96875)-(0))/(670809454.308)) * -20.90625))-`y`)*((0 + ((((`x`*25.21875+`y`*-20.90625+`z`*25899.96875)-(0))/(670809454.308)) * -20.90625))-`y`))+(((0 + ((((`x`*25.21875+`y`*-20.90625+`z`*25899.96875)-(0))/(670809454.308)) * 25899.96875))-`z`)*((0 + ((((`x`*25.21875+`y`*-20.90625+`z`*25899.96875)-(0))/(670809454.308)) * 25899.96875))-`z`)))
    <=600
ORDER BY
    SQRT( pow(`x`-25.21875,2)+ pow(`y`--20.90625,2)+ pow(`z`-25899.96875,2) ) DESC

My table definition looks like this:

CREATE TABLE IF NOT EXISTS `stars` (
    `localkey` bigint(20) NOT NULL AUTO_INCREMENT,
    `id` bigint(20) NOT NULL,
    `x` double NOT NULL,
    `y` double NOT NULL,
    `z` double NOT NULL,
    `name` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`localkey`),
UNIQUE KEY `id` (`id`),
KEY `x` (`x`),
KEY `y` (`y`),
KEY `z` (`z`),
KEY `xyz` (`x`,`y`,`z`),
KEY `name` (`name`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci 

The explain for the query results indicates no index usage and an extra of:

extra: Using where; Using filesort;

What I've tried so far:

Are there other options I'm missing?

Thanks!

Upvotes: 0

Views: 371

Answers (4)

Shaunak
Shaunak

Reputation: 18018

Along with the two great mathematical optimizations suggested, the biggest jump in improving the speed would be form minimizing any computation and reducing your search space. That means Spatial Indexing.

I am not an expert in MySQL, but idea is you pre-generate spatial indexes in 3D space so your search space is dramatically reduced.

To be more precise, with a full table scan, your complexity becomes O(n^2). The time required to search increases with the size of table, and how far down the table you search. However with tree based spatial index it can be reduced to O(n log n)

Think about this dividing the space into fixed size cubes (and cubes within cubes). Not unlike how google map arranges tiles. Now with the indexing, you have a "wormhole" to each cube, based on the initial coordinates, as you can calculate the cube you can find a star in with O(n) time. Then all you have to do is run search in this cube.

Here's some reference in MySQL docs on spatial indexes.

I faced a similar problem dealing with LiDAR data in two coordinates few years back. Here's the link to my question and answers that may help you get some ideas:

https://gis.stackexchange.com/questions/12030/optimize-nearest-neighbor-query-on-70-million-point-cloud-on-sql-server-2008

Upvotes: 1

Rick James
Rick James

Reputation: 142278

The bounding-box approach might even use an index on one of the dimensions. But not if written the way Marc suggests. Instead:

`x` BETWEEN $x - $dist AND $x + $dist

The general principle is that you should not hide an indexed variable in a function. ABS in this example.

Also...

ORDER BY d -- this will avoid recomputing the SQRT

Does double minus in pow(y--20.90625,2) really work? To fix it, swap them:

pow(-20.90625 - `y`,2)

It is messier to set up, but multiply might be faster than POW:

(-20.90625 - `y`) * (-20.90625 - `y`)

Upvotes: 0

roderick young
roderick young

Reputation: 304

In addition to the excellent suggestion of fast pre-filtering that Marc B suggested above, when you do your second pass, you might save a little computing in the distance formula in two ways:

1) use (x-k) * (x-k) instead of calling pow(x-k, ...)

2) skip the square root and calculate distance squared. You would then compare to square of the distance that you need, which only needs to be calculated once.

Upvotes: 1

Marc B
Marc B

Reputation: 360632

Assuming that only a smaller subset of your records will match, you can reduce the math load by doing a basic "rectangular" filtering first. e.g. there's no point in performing a full cartesian distance for EVERY record in the table, only to throw away most of them.

A simple "box" boundary check is just a simple subtraction and comparison:

SELECT ...
FROM (
    SELECT ...
    WHERE (
        (abs($x_coord - x_coordinate) <= $max_distance)
     OR (abs($y_coord - y_coordinate) <= $max_distance)
    )
) AS square_filter
WHERE ... full calculation here

Of cousre, you're doing 3d positions, so it's a bit more complicated, but this should give you the basic idea.

Upvotes: 3

Related Questions