Reputation: 5070
This is my SQL query with which I get all the duplicates, but one(the newest one):
SELECT d.C_ContactID, d.C_EmailAddress, d.C_DataSourceID, d.C_DateCreated
FROM duplicates as d
WHERE d.C_DateCreated !=(select max(d2.C_DateCreated)
FROM duplicates d2
WHERE d2.C_DataSourceId = d.C_DataSourceId)
Is it possible to optimize it somehow? Unfortunately in 300 000 records it takes +- 40minutes.
Method where the query is:
public ArrayList<Record> get() throws SQLException,
ClassNotFoundException {
Statement st = DBConnect.DBC.con.createStatement();
String sql = ("select d.C_ContactID, d.C_EmailAddress, d.C_DataSourceID,
d.C_DateCreated "
+ "from duplicates as d "
+ "where d.C_DateCreated !=(select max(d2.C_DateCreated) "
+ "from duplicates d2 where d2.C_DataSourceId = d.C_DataSourceId)");
ResultSet rs = st.executeQuery(sql);
DBConnect.DBC.con.commit();
while (rs.next()) {
int contactID = rs.getInt("C_ContactID");
String email = rs.getString("C_EmailAddress");
String dataSourceID = rs.getString("C_DataSourceID");
String dateCreated = rs.getString("C_DateCreated");
duplicate = new Record(contactID, email, dataSourceID, dateCreated);
duplicates.add(duplicate);
}
rs.close();
st.close();
return duplicates;
}
Upvotes: 0
Views: 72
Reputation: 1243
In T-SQL the answer would look like this, but I don't think SQLite supports window functions (leaving the answer intact for posterity though):
You can use a window function to label each row with it's position in the group of common C_DataSourceIDs, then with a CTE select the rows that aren't in position 1 like this.
with ordered as (
select
d.C_ContactID,
d.C_EmailAddress,
d.C_DataSourceID,
d.C_DateCreated,
row_number() over (
partition by
d.C_DataSourceID
order by
d.C_DateCreated
) as rownum
from
duplicates
) select
C_ContactID,
C_EmailAddress,
C_DataSourceID,
C_DateCreated
from
ordered
where
rownum != 1;
With an index on (C_DataSourceID, C_DateCreated)
this will only need a single pass over the table instead of a self join that you have in your query.
Upvotes: 0
Reputation: 1269773
You would start by creating an index on duplicates(C_DataSourceId, C_DateCreated)
:
create index duplicates_DataSourceId_DateCreated on duplicates(C_DataSourceId, C_DateCreated);
If you are using a database that supports window functions, then I would rephrase this as:
SELECT d.C_ContactID, d.C_EmailAddress, d.C_DataSourceID, d.C_DateCreated
FROM (select d.*, max(C_DateCreated) over (partition by C_DataSourceId) as maxdc
from duplicates d
) d
WHERE d.C_DateCreated <> maxdc;
It is worth doing the comparison, because sometimes window functions have efficient implementations.
And, if you have the index, a slightly more efficient version of your query is:
SELECT d.C_ContactID, d.C_EmailAddress, d.C_DataSourceID, d.C_DateCreated
FROM duplicates d
WHERE EXISTS (select 1
from duplicates 2
where d2.C_DataSourceId = d.C_DataSourceId and
d2.C_DateCreated > d.C_DateCreated
);
This says to get all rows from duplicates where there is another row (with the same source) that has a bigger date created. The slight advantage is that this doesn't have to look at all the values to get the max()
. It only has to find the first one. The major performance improvement will be the composite index.
Upvotes: 2
Reputation: 61
Create index for the columns C_DateCreated and C_DataSourceId. This will reduce the time to execute the query. Refer this link to know how to create indices. http://www.tutorialspoint.com/mysql/mysql-indexes.htm
Upvotes: 1