Ondrej Tokar
Ondrej Tokar

Reputation: 5070

SQL statement takes huge amount of time, is it possible optimize it?

This is my SQL query with which I get all the duplicates, but one(the newest one):

SELECT d.C_ContactID, d.C_EmailAddress, d.C_DataSourceID, d.C_DateCreated 
FROM duplicates as d 
WHERE d.C_DateCreated !=(select max(d2.C_DateCreated) 
FROM duplicates d2 
WHERE d2.C_DataSourceId = d.C_DataSourceId)

Is it possible to optimize it somehow? Unfortunately in 300 000 records it takes +- 40minutes.

Method where the query is:

public ArrayList<Record> get() throws SQLException,
        ClassNotFoundException {
    Statement st = DBConnect.DBC.con.createStatement();
    String sql = ("select d.C_ContactID, d.C_EmailAddress, d.C_DataSourceID, 
d.C_DateCreated "
            + "from duplicates as d "
            + "where d.C_DateCreated !=(select max(d2.C_DateCreated) "
            + "from duplicates d2 where d2.C_DataSourceId = d.C_DataSourceId)");
    ResultSet rs = st.executeQuery(sql);
    DBConnect.DBC.con.commit();
    while (rs.next()) {

        int contactID = rs.getInt("C_ContactID");
        String email = rs.getString("C_EmailAddress");
        String dataSourceID = rs.getString("C_DataSourceID");
        String dateCreated = rs.getString("C_DateCreated");
        duplicate = new Record(contactID, email, dataSourceID, dateCreated);
        duplicates.add(duplicate);
    }
    rs.close();
    st.close();

    return duplicates;
}

Upvotes: 0

Views: 72

Answers (3)

Steve
Steve

Reputation: 1243

In T-SQL the answer would look like this, but I don't think SQLite supports window functions (leaving the answer intact for posterity though):

You can use a window function to label each row with it's position in the group of common C_DataSourceIDs, then with a CTE select the rows that aren't in position 1 like this.

with ordered as (
    select
        d.C_ContactID,
        d.C_EmailAddress,
        d.C_DataSourceID,
        d.C_DateCreated,
        row_number() over (
            partition by
                d.C_DataSourceID
            order by
                d.C_DateCreated
        ) as rownum
    from
        duplicates
) select
    C_ContactID,
    C_EmailAddress,
    C_DataSourceID,
    C_DateCreated
from
    ordered
where
    rownum != 1;

With an index on (C_DataSourceID, C_DateCreated) this will only need a single pass over the table instead of a self join that you have in your query.

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269773

You would start by creating an index on duplicates(C_DataSourceId, C_DateCreated):

create index duplicates_DataSourceId_DateCreated on duplicates(C_DataSourceId, C_DateCreated);

If you are using a database that supports window functions, then I would rephrase this as:

SELECT d.C_ContactID, d.C_EmailAddress, d.C_DataSourceID, d.C_DateCreated 
FROM (select d.*, max(C_DateCreated) over (partition by C_DataSourceId) as maxdc
      from duplicates d 
     ) d
WHERE d.C_DateCreated <> maxdc;

It is worth doing the comparison, because sometimes window functions have efficient implementations.

And, if you have the index, a slightly more efficient version of your query is:

SELECT d.C_ContactID, d.C_EmailAddress, d.C_DataSourceID, d.C_DateCreated 
FROM duplicates d
WHERE EXISTS (select 1
              from duplicates 2
              where d2.C_DataSourceId = d.C_DataSourceId and
                    d2.C_DateCreated > d.C_DateCreated
             );

This says to get all rows from duplicates where there is another row (with the same source) that has a bigger date created. The slight advantage is that this doesn't have to look at all the values to get the max(). It only has to find the first one. The major performance improvement will be the composite index.

Upvotes: 2

Pradeep PK
Pradeep PK

Reputation: 61

Create index for the columns C_DateCreated and C_DataSourceId. This will reduce the time to execute the query. Refer this link to know how to create indices. http://www.tutorialspoint.com/mysql/mysql-indexes.htm

Upvotes: 1

Related Questions