Reputation: 6208

Performance problems with sorting sql query result

I'm not very good in sql so I have some problems with sorting. I have table which contains geographical coordinates of some locations. I need to find 10 most closest location and show them. So for this I use such sql query :

SELECT 
      6373*2*ATAN2(
      POWER(POWER(SIN(lat/2),2) + COS(f_lat)*COS(s_lat)*POWER(SIN(lon/2),2) , 0.5), 
      POWER(1-(POWER(SIN(lat/2),2) + COS(f_lat)*COS(s_lat)*POWER(SIN(lon/2),2)),0.5)) AS d
FROM
      (SELECT
             (f.loc_lat)*3.1415/180 AS f_lat, (f.loc_lon)*3.1415/180 AS f_lon,
             (s.loc_lat)*3.1415/180 AS s_lat, (s.loc_lon)*3.1415/180 AS s_lon,
             (s.loc_lat)*3.1415/180 - (f.loc_lat)*3.1415/180 AS lat, 
             (s.loc_lon)*3.1415/180 - (f.loc_lon)*3.1415/180 AS lon
      FROM htl f, htl s 
      WHERE f.htl_cd != s.htl_cd 
      AND f.loc_lat is not null
      AND f.loc_lon is not null
      AND f.ctry_cd = s.ctry_cd
      ) location
)order by d asc;

If I run this query without last line everything is fine. But when I want to sort result I have problem. Sorting is processing to long for me (nearly 30 minutes).

count of query is nearly 28 millions of lines.

So may be I need to rewrite my query or may be you know some hints to speed it up.

UPDATE

data in table htl looks like this:

htl_cd           loc_lat         loc_lon

GDLUA           20.690688        -103.416281

LPLRL           53.463509        -2.876497

BDLWL           41.925992        -72.670218

MKCGM           38.892569        -94.523921

first column is simple primary key. second and third columns it is latitude and longitude (coordinates)

6373*2*ATAN2(
          POWER(POWER(SIN(lat/2),2) + COS(f_lat)*COS(s_lat)*POWER(SIN(lon/2),2) , 0.5), 
          POWER(1-(POWER(SIN(lat/2),2) + COS(f_lat)*COS(s_lat)*POWER(SIN(lon/2),2)),0.5)) AS d

this formula I use for calculate distance between two locations

UPDATE2 for thous who need execution plan

Execution Plan

Plan hash value: 446529610

--------------------------------------------------------------------------------

--------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 12M| 733M| | 186K (2)| 00:37: 22 |

| 1 | SORT ORDER BY | | 12M| 733M| 1771M| 186K (2)| 00:37: 22 |

|* 2 | HASH JOIN | | 12M| 733M| | 259 (84)| 00:00: 04 |

|* 3 | TABLE ACCESS FULL| HTL | 5081 | 148K| | 22 (0)| 00:00: 01 |

| 4 | TABLE ACCESS FULL| HTL | 5411 | 158K| | 22 (0)| 00:00: 01 |

--------------------------------------------------------------------------------

Predicate Information (identified by operation id):

   2 - access("F"."CTRY_CD"="S"."CTRY_CD")
       filter("F"."HTL_CD"<>"S"."HTL_CD")
   3 - filter("F"."LOC_LAT" IS NOT NULL)

Statistics

        537  recursive calls
         18  db block gets
        152  consistent gets
      68534  physical reads
          0  redo size
  463630284  bytes sent via SQL*Net to client
    9504847  bytes received via SQL*Net from client
     864044  SQL*Net roundtrips to/from client
          0  sorts (memory)
          1  sorts (disk)
   12960638  rows processed

Upvotes: 2

Answers (3)

MT0

Reputation: 168232

Your query is returning each distance twice. If you have the closest rows a and b then it will return the distance for a to b and for b to a. You can eliminate half the calculations by just comparing the distances in one of the two permutations:

SQL Fiddle

Oracle 11g R2 Schema Setup:

CREATE TABLE htl ( htl_cd, loc_lat, loc_lon, ctry_cd ) AS
          SELECT 'GDLUA', 20.690688, -103.416281, 1 FROM DUAL
UNION ALL SELECT 'LPLRL', 53.463509,   -2.876497, 1 FROM DUAL
UNION ALL SELECT 'BDLWL', 41.925992,  -72.670218, 1 FROM DUAL
UNION ALL SELECT 'MKCGM', 38.892569,  -94.523921, 1 FROM DUAL
/

BEGIN
  FOR i IN 1 .. 20 LOOP
    INSERT INTO htl VALUES(
      'GD' || LPAD( TO_CHAR( i ), 3, '0' ),
      20.690688 + i / 1000,
      -103.416281 + i / 1000,
      1
    );
  END LOOP;
END;
/

Query 1:

WITH indexed_locations AS (
  SELECT h.*,
         ROWNUM AS idx
  FROM   htl h
),
location AS (
  SELECT
          (f.loc_lat)*3.1415/180 AS f_lat, (f.loc_lon)*3.1415/180 AS f_lon,
          (s.loc_lat)*3.1415/180 AS s_lat, (s.loc_lon)*3.1415/180 AS s_lon,
          (s.loc_lat)*3.1415/180 - (f.loc_lat)*3.1415/180 AS lat, 
          (s.loc_lon)*3.1415/180 - (f.loc_lon)*3.1415/180 AS lon
  FROM    indexed_locations f
          INNER JOIN
          indexed_locations s
          ON (
                  f.htl_cd  != s.htl_cd 
              AND f.ctry_cd  = s.ctry_cd
              AND f.idx      < s.idx
             )
  WHERE
          f.loc_lat is not null
  AND     f.loc_lon is not null

),
distances AS (
  SELECT 
        6373*2*ATAN2(
        POWER(POWER(SIN(lat/2),2) + COS(f_lat)*COS(s_lat)*POWER(SIN(lon/2),2) , 0.5), 
        POWER(1-(POWER(SIN(lat/2),2) + COS(f_lat)*COS(s_lat)*POWER(SIN(lon/2),2)),0.5)) AS d
  FROM  location
  ORDER BY d ASC
)
SELECT d
FROM   distances
WHERE  ROWNUM < 11

Results:

|              D |
|----------------|
| 0.152300994984 |
| 0.152301463917 |
| 0.152301932829 |
| 0.152302401722 |
| 0.152302870595 |
| 0.152303339447 |
|  0.15230380828 |
| 0.152304277093 |
| 0.152304745885 |
| 0.152305214658 |

You can also move all the calculation into a user-defined function and make the function DETERMINISTIC (which will help if the same input parameters are used) - however, beware when profiling this that when it is run once then the results may be cached and subsequent runs may use the cached results so you will not see the true cost of the query every time:

CREATE OR REPLACE FUNCTION calc_Lat_Long_Distance(
  f_loc_lat HTL.LOC_LAT%TYPE,
  f_loc_lon HTL.LOC_LON%TYPE,
  s_loc_lat HTL.LOC_LAT%TYPE,
  s_loc_lon HTL.LOC_LON%TYPE
) RETURN NUMBER DETERMINISTIC
AS
  PI    CONSTANT REAL := 3.1415/180;
  f_lat       REAL := f_loc_lat * PI;
  f_lon       REAL := f_loc_lon * PI;
  s_lat       REAL := s_loc_lat * PI;
  s_lon       REAL := s_loc_lon * PI;
  lat_dif     REAL := s_lat - f_lat;
  lon_dif     REAL := s_lon - f_lon;
  p           REAL := POWER(SIN(lat_dif/2),2) + COS(f_lat)*COS(s_lat)*POWER(SIN(lon_dif/2),2);
BEGIN
  RETURN 6373*2*ATAN2( POWER(p, 0.5), POWER(1-p,0.5) );
END;
/

You can also use RESULT_CACHE instead of DETERMINISTIC - profile them and see whether one makes a particular difference.

Query 2:

WITH indexed_locations AS (
  SELECT h.*,
         ROWNUM AS idx
  FROM   htl h
),
distances AS (
  SELECT  calc_Lat_Long_Distance(
            f.loc_lat,
            f.loc_lon,
            s.loc_lat,
            s.loc_lon
          ) AS d
  FROM    indexed_locations f
          INNER JOIN
          indexed_locations s
          ON (
                  f.htl_cd  != s.htl_cd 
              AND f.ctry_cd = s.ctry_cd
              AND f.idx     < s.idx
             )
  WHERE
          f.loc_lat is not null
  AND     f.loc_lon is not null
  ORDER BY d ASC
)
SELECT d
FROM   distances
WHERE  ROWNUM < 11

Results:

|              D |
|----------------|
| 0.152300994984 |
| 0.152301463917 |
| 0.152301932829 |
| 0.152302401722 |
| 0.152302870595 |
| 0.152303339447 |
|  0.15230380828 |
| 0.152304277093 |
| 0.152304745885 |
| 0.152305214658 |

Upvotes: 2

Jumpei

Reputation: 621

SELECT * FROM (
      SELECT 
            6373*2*ATAN2(
            POWER(POWER(SIN(lat/2),2) + COS(f_lat)*COS(s_lat)*POWER(SIN(lon/2),2) , 0.5), 
            POWER(1-(POWER(SIN(lat/2),2) + COS(f_lat)*COS(s_lat)*POWER(SIN(lon/2),2)),0.5)) AS d,
            ROW_NUMBER() OVER (ORDER BY d asc) AS rownum
      FROM
            (SELECT
                   (f.loc_lat)*3.1415/180 AS f_lat, (f.loc_lon)*3.1415/180 AS f_lon,
                   (s.loc_lat)*3.1415/180 AS s_lat, (s.loc_lon)*3.1415/180 AS s_lon,
                   (s.loc_lat)*3.1415/180 - (f.loc_lat)*3.1415/180 AS lat, 
                   (s.loc_lon)*3.1415/180 - (f.loc_lon)*3.1415/180 AS lon
            FROM htl f, htl s 
            WHERE f.htl_cd != s.htl_cd 
            AND f.loc_lat is not null
            AND f.loc_lat is not null
            AND f.ctry_cd = s.ctry_cd
            ) location
)WHERE rownum <= 10;

If you need only top 10 records, try above.

Upvotes: 0

David Aldridge

Reputation: 52386

If I run this query without last line everything is fine.

Do you mean that you get the first rows back from the query quickly, or that the entire set of 28 million rows is returned quickly? Your query is pretty computationally intensive.

But when I want to sort result I have problem. Sorting is processing to long for me (nearly 30 minutes).

A sort requires that all values are computed and that they are all ordered before any rows are returned. If all rows are returned from an unsorted query quickly, then it is the sort stage that is taking a long time, and the likely reason is that the sort is spilling to temporary disk. You can monitor the sort space required through the v$sql_workarea_active view while the query is running, and that will tell you how much memory you need to allocate in order to avoid the use of temporary disk space.

You can also get an accurate breakout of timings of each stage in the execution plan by using the gather_plan_statistics hint and DBMS_XPlan. http://docs.oracle.com/cd/E11882_01/appdev.112/e25788/d_xplan.htm#ARPLS70137

Finally, do you actually need all 28 million rows to be sorted? Do you need to know which points are the furthest apart, or just the ones that are closest together? If the latter, perhaps you can consider filtering the result set prior to the ORDER BY.

Upvotes: 2

Performance problems with sorting sql query result

Execution Plan

--------------------------------------------------------------------------------

--------------------------------------------------------------------------------

--------------------------------------------------------------------------------

Predicate Information (identified by operation id):

Statistics

Answers (3)

Related Questions