infosec
infosec

Reputation: 39

SQL Server inefficient query (server full load)

We are currently scraping structured data from a variety of different sources. Before ingesting new data into our table, we check to see if the data_id exists already.

IF NOT EXISTS (SELECT TOP 1 * FROM TABLE_NAME WHERE DATA_ID=@P0)

We have no indexes; however, we have a PK set for our id column which seems unnecessary, should we remove this to improve insert speed?

Our server is currently at full load checking through 3 million or so worth of data to make sure we are not inserting duplicate data. We have tried upgrading our SQL Server for higher DTU but that doesn't seem to help at all.

When we have multiple jobs running at the same time checking for unique data or SQL Server comes to a crawl and insert speed takes forever.

Should we get rid of this unique data check and create new tables for every scraping job, then use a SQL Query to compare the differences, such as new data or data that was removed?

Query used for conditional insertion:

String sql = "IF NOT EXISTS (SELECT TOP 1 * FROM A_PROV_CVV_LDG_1 WHERE DATA_ID=?) " +
            "INSERT INTO A_PROV_CVV_LDG_1 (DATA_ID, SourceID, BASE_ID, BIN, BANK, CARD_TYPE, CARD_CLASS," +
            " CARD_LEVEL, CARD_EXP, COUNTRY, STATE, CITY, ZIP, DOB, SSN, EMAIL, PHONE, GENDER, ADDR_LINE_1, ADDR_LINE_2," +
            " FIRST_NAME, LAST_NAME, DateAddedToMarket, PRICE) " +
            "VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)";

This is the entire table definition, no indexes only PK is 'id' which seems unnecessary.

+-------------------+--------------+-----------+ | (PK)id | int | Unchecked | | DATA_ID | int | Checked | | SourceID | int | Checked | | BASE_ID | varchar(255) | Checked | | BIN | varchar(255) | Checked | | BANK | varchar(255) | Checked | | CARD_TYPE | varchar(255) | Checked | | CARD_CLASS | varchar(255) | Checked | | CARD_LEVEL | varchar(255) | Checked | | CARD_EXP | varchar(255) | Checked | | COUNTRY | varchar(255) | Checked | | STATE | varchar(255) | Checked | | CITY | varchar(255) | Checked | | ZIP | varchar(255) | Checked | | DOB | varchar(255) | Checked | | SSN | varchar(255) | Checked | | EMAIL | varchar(255) | Checked | | PHONE | varchar(255) | Checked | | GENDER | varchar(255) | Checked | | ADDR_LINE_1 | varchar(255) | Checked | | ADDR_LINE_2 | varchar(255) | Checked | | FIRST_NAME | varchar(255) | Checked | | LAST_NAME | varchar(255) | Checked | | PRICE | varchar(255) | Checked | | DateAddedToMarket | varchar(255) | Checked | | DateAdded | datetime | Unchecked | +-------------------+--------------+-----------+

Upvotes: 0

Views: 248

Answers (3)

O. Jones
O. Jones

Reputation: 108651

You absolutely need a unique index on DATA_ID for your query--indeed for any deduplication attempt on DATA_ID---to work efficiently. Without it every attempted insert does a full table scan.

Yes, indexes slow down insertion a little bit. But an index on an integer column isn't very expensive. Certainly not compared to the mess you're in now with a table scan for every insertion. Create that index.

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269773

This construct:

IF NOT EXISTS (SELECT TOP 1 * FROM A_PROV_CVV_LDG_1 WHERE DATA_ID=?)
    INSERT INTO A_PROV_CVV_LDG_1 . . .

is an anti-pattern. It is attempting to prevent duplicates in code. However, it suffers from race conditions. And you should let the database implement data integrity checks, where it can.

Instead, implement a unique constraint/index to prevent duplicates:

alter table A_PROV_CVV_LDG_1 add constraint unq_A_PROV_CVV_LDG_1_data_id
    unique (data_id);

This does mean that you need to catch an error if you try to insert a duplicate value. That is easy enough in SQL Server using try/catch blocks.

Upvotes: 0

cloudsafe
cloudsafe

Reputation: 2504

If the server is busy, the statement: IF NOT EXISTS (SELECT TOP 1 * FROM TABLE_NAME WHERE DATA_ID=@P0) might be blocked or disk requests are queued. Run sp_who2 to check if there is blocking. If this is the only routine that puts data in the table add WITH (NOLOCK), selecting Null instead of anything unnecessary:

IF NOT EXISTS (SELECT null FROM TABLE_NAME WITH (NOLOCK) WHERE DATA_ID=@P0)

Upvotes: 1

Related Questions