Reputation: 143
I have an UPDATE statement with a subquery in WHERE to find the duplicates. The subquery has errors that is revealed when running the subquery itself but when running that in the UPDATE statement no error is shown and the DML runs OK (but really slow).
See the table setup:
CREATE TABLE RAW_table
(
ERROR_LEVEL NUMBER(3),
RAW_DATA_ROW_ID INTEGER,
ATTRIBUTE_1 VARCHAR2(4000 BYTE)
)
;
INSERT INTO RAW_table VALUES (0, 2, '509NTQD9Q868');
INSERT INTO RAW_table VALUES (0, 2, '509NTQD9Q868');
INSERT INTO RAW_table VALUES (0, 2, '509NTQD9Q868');
INSERT INTO RAW_table VALUES (0, 3, '509NTVS9Q863');
INSERT INTO RAW_table VALUES (0, 3, '509NTVS9Q863');
INSERT INTO RAW_table VALUES (0, 3, '509NTVS9Q863');
COMMIT;
The query with error is:
SELECT UPPER(ATTRIBUTE_1), rid
FROM ( SELECT UPPER(ATTRIBUTE_1)
, ROWID AS rid
, ROW_NUMBER() OVER ( PARTITION BY UPPER (ATTRIBUTE_1) ORDER BY RAW_DATA_ROW_ID) AS RN
FROM RAW_table
)
WHERE RN > 1;
It gives ORA-00904: "ATTRIBUTE_1": invalid identifier
when run.
However, the below DML that uses the above query (as of line 4) in the WHERE statement works OK:
set timing on
UPDATE RAW_table
SET ERROR_LEVEL = 4
WHERE (UPPER (ATTRIBUTE_1), ROWID)
IN (SELECT UPPER (ATTRIBUTE_1), rid
FROM (SELECT UPPER (ATTRIBUTE_1), ROWID AS rid
, ROW_NUMBER() OVER ( PARTITION BY UPPER (ATTRIBUTE_1) ORDER BY RAW_DATA_ROW_ID) AS RN
FROM RAW_table
)
WHERE RN > 1
)
;
4 rows updated.
Elapsed: 00:00:00.36
Why? Why? Why?
I expected the UPDATE to fail with ORA-00904: "ATTRIBUTE_1": invalid identifier
too. Why it does not fail?
The real problem, however, is not that the UPDATE actually works, but that it works really slow.
When I corrected the subquery not to trigger ORA-00904: "ATTRIBUTE_1": invalid identifier
to something like this:
UPDATE RAW_table
SET ERROR_LEVEL = 4
WHERE (UPPER (ATTRIBUTE_1), ROWID)
IN (SELECT checked_column, rid
FROM (SELECT UPPER (ATTRIBUTE_1) AS checked_column, ROWID AS rid
, ROW_NUMBER() OVER ( PARTITION BY UPPER (ATTRIBUTE_1) ORDER BY RAW_DATA_ROW_ID) AS RN
FROM RAW_table
)
WHERE RN > 1
)
;
The query accelerated nearly 400 times on the test data set of 11.000 rows:
SELECT COUNT(*) FROM RAW_table;
COUNT(*)
----------
11004
1 row selected.
Corrected query:
1005 rows updated.
Elapsed: 00:00:00.28
Query with ORA-904:
1005 rows updated.
Elapsed: 00:01:48.40
I was not patient enough to wait till the end of 71.000 rows test:
SELECT COUNT(*) FROM RAW_table;
COUNT(*)
----------
71475
1 row selected.
Corrected query
11004 rows updated.
Elapsed: 00:00:00.60
Query with ORA-904
Cancelled after 30 min...
Explain plan for the query with ORA-904:
UPDATE STATEMENT ALL_ROWS Cost: **2 544 985 615** Bytes: 8 464 752 Cardinality: 4 176
7 UPDATE RAW_TABLE
6 FILTER
1 TABLE ACCESS FULL TABLE RAW_TABLE Cost: 54 Bytes: 169 282 878 Cardinality: 83 514
5 VIEW Cost: 30 486 Bytes: 2 087 850 Cardinality: 83 514
4 WINDOW SORT Cost: 30 486 Bytes: 169 282 878 Cardinality: 83 514
3 FILTER
2 TABLE ACCESS FULL TABLE RAW_TABLE Cost: 54 Bytes: 169 282 878 Cardinality: 83 514
Explain plan for corrected query:
UPDATE STATEMENT ALL_ROWS Cost: **36 637** Bytes: 3 374 235 Cardinality: 835
7 UPDATE RAW_TABLE
6 HASH JOIN RIGHT SEMI Cost: 36 637 Bytes: 3 374 235 Cardinality: 835
4 VIEW VIEW SYS.VW_NSO_1 Cost: 30 486 Bytes: 168 197 196 Cardinality: 83 514
3 VIEW Cost: 30 486 Bytes: 169 282 878 Cardinality: 83 514
2 WINDOW SORT Cost: 30 486 Bytes: 169 282 878 Cardinality: 83 514
1 TABLE ACCESS FULL TABLE RAW_TABLE Cost: 54 Bytes: 169 282 878 Cardinality: 83 514
5 TABLE ACCESS FULL TABLE RAW_TABLE Cost: 54 Bytes: 169 282 878 Cardinality: 83 514
After analyzing the table the costs are plans are the same. Explain plan for the query with ORA-904:
UPDATE STATEMENT ALL_ROWS Cost: **29 381 690** Bytes: 38 Cardinality: 2
7 UPDATE RAW_TABLE
6 FILTER
1 TABLE ACCESS FULL TABLE RAW_TABLE Cost: 54 Bytes: 1 358 025 Cardinality: 71 475
5 VIEW Cost: 427 Bytes: 1 786 875 Cardinality: 71 475
4 WINDOW SORT Cost: 427 Bytes: 1 358 025 Cardinality: 71 475
3 FILTER
2 TABLE ACCESS FULL TABLE RAW_TABLE Cost: 54 Bytes: 1 358 025 Cardinality: 71 475
Explain plan for corrected query:
UPDATE STATEMENT ALL_ROWS Cost: **3 123** Bytes: 1 453 595 Cardinality: 715
7 UPDATE RAW_TABLE
6 HASH JOIN SEMI Cost: 3 123 Bytes: 1 453 595 Cardinality: 715
5 VIEW VIEW SYS.VW_NSO_1 Cost: 427 Bytes: 143 950 650 Cardinality: 71 475
4 VIEW Cost: 427 Bytes: 144 879 825 Cardinality: 71 475
3 WINDOW SORT Cost: 427 Bytes: 1 358 025 Cardinality: 71 475
2 TABLE ACCESS FULL TABLE RAW_TABLE Cost: 54 Bytes: 1 358 025 Cardinality: 71 475
1 TABLE ACCESS FULL TABLE RAW_TABLE Cost: 54 Bytes: 1 358 025 Cardinality: 71 475
Explain plan cost says it all, but why it is so much different?
I've just triggered a 71.000 rows test again, after computing statistics on the table, but it is already running for few minutes...
This all is on Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit.
Upvotes: 2
Views: 186
Reputation: 59476
Maybe these versions are even faster (at least they are more compact):
UPDATE RAW_table
SET ERROR_LEVEL = 4
WHERE ROWID <>ALL (SELECT MIN(ROWID) FROM RAW_table GROUP BY UPPER(ATTRIBUTE_1));
UPDATE RAW_table
SET ERROR_LEVEL = 4
WHERE ROWID <>ALL (SELECT FIRST_VALUE(ROWID) OVER (PARTITION BY UPPER(ATTRIBUTE_1) ORDER BY RAW_DATA_ROW_ID) FROM RAW_table);
Note, <>ALL
is equivalent to NOT IN
- it's just my personal preference to use <>ALL
.
Upvotes: 0
Reputation: 1269923
Your SELECT
fails because there is no column in the subquery called ATTRIBUTE_1
. You would need to assign the name:
SELECT UPPER(ATTRIBUTE_1), rid
FROM ( SELECT UPPER(ATTRIBUTE_1) as ATTRIBUTE_1,
ROWID AS rid,
ROW_NUMBER() OVER (PARTITION BY UPPER(ATTRIBUTE_1) ORDER BY RAW_DATA_ROW_ID) AS RN
FROM RAW_table
)
WHERE RN > 1;
The UPDATE
does not generate the error because it pulls the value from the outer query:
UPDATE RAW_table
-------^
| SET ERROR_LEVEL = 4
| WHERE (UPPER (ATTRIBUTE_1), ROWID) IN
| (SELECT checked_column, rid
| FROM (SELECT UPPER(ATTRIBUTE_1) AS checked_column, ROWID AS rid,
------------------------------^ This is interpreted as RAW_table.ATTRIBUTE_1
ROW_NUMBER() OVER (PARTITION BY UPPER(ATTRIBUTE_1) ORDER BY RAW_DATA_ROW_ID) AS RN
FROM RAW_table
)
WHERE RN > 1
)
This correlation is probably not what you intend and one reason why I recommend that column names always be qualified (that is, include a table alias).
Upvotes: 3
Reputation: 231671
This is why aliases are really, really useful.
In the query
UPDATE RAW_table
SET ERROR_LEVEL = 4
WHERE (UPPER (ATTRIBUTE_1), ROWID)
IN (SELECT UPPER (ATTRIBUTE_1), rid
FROM (SELECT UPPER (ATTRIBUTE_1), ROWID AS rid
, ROW_NUMBER() OVER ( PARTITION BY UPPER (ATTRIBUTE_1)
ORDER BY RAW_DATA_ROW_ID) AS RN
FROM RAW_table
)
WHERE RN > 1
)
SELECT UPPER (ATTRIBUTE_1)
is valid because it can be resolved as a reference to the table you are updating not to the table in the FROM
. With aliases, that query is equivalent to
UPDATE RAW_table dest
SET dest.ERROR_LEVEL = 4
WHERE (UPPER (dest.ATTRIBUTE_1), ROWID)
IN (SELECT UPPER (dest.ATTRIBUTE_1), src.rid
FROM (SELECT UPPER (rt.ATTRIBUTE_1), rt.ROWID AS rid
, ROW_NUMBER() OVER ( PARTITION BY UPPER (rt.ATTRIBUTE_1)
ORDER BY rt.RAW_DATA_ROW_ID) AS RN
FROM RAW_table rt
) src
WHERE src.rid > 1
)
Of course, if you had written it this way, it would be immediately clear that you were referencing the dest.attribute_1
rather than the src.attribute_1
. That (and many other reasons) is why it's a good idea to alias your columns-- it makes it clear which object you intended to reference and throws an error when the intended reference is invalid rather than potentially resolving it to something you didn't intend.
Upvotes: 2
Reputation: 12169
SELECT UPPER(ATTRIBUTE_1), rid
FROM ( SELECT UPPER(ATTRIBUTE_1) ATTRIBUTE_1
, ROWID AS rid
, ROW_NUMBER() OVER ( PARTITION BY UPPER (ATTRIBUTE_1) ORDER BY RAW_DATA_ROW_ID) AS RN
FROM RAW_table
)
WHERE RN > 1
Upvotes: 0