Mysql Regex query to find slightly different duplicates

Question

I have a table, and it has a few columns, like ID, name, etc. There is also a column that holds a JSON object. Some rows got duplicated, due to a bug. I've been trying to write a query that will find all of the duplicates.

Example of JSON:

    {"flowId":"63","nodeId":2,"triggerLogId":"39397","modelId":"146",...}

I want to be able to find a duplicate of the above, where everything is the same, except the "triggerLogId" could be two different numbers.

Example of possible duplicate JSON:

    {"flowId":"63","nodeId":2,"triggerLogId":"56217","modelId":"146",...}

I figured out a few queries to do this if the triggerLogId is the same on both rows, but I can't seem to find any regex that I can use to compare the two columns.

From what I understand, MySQL doesn't have regex back references, so I can't use those. Is this even possible in pure MySQL? The flowId and modelId can be various numbers, so having a list of those wont work.

Searched most of the Stack overflow questions, and tons of google results, hoping someone here will know something I haven't been able to find. :P

Will I end up having to use PHP?

Edit (Table Structure):

id ----- int(11) Auto Increment
type-- varchar(20)
time-- bigint(20) NULL
data-- text

Alvaro Fla&#241;o Larrondo · Accepted Answer

This is not exactly an answer. This will hopefully help someone or yourself to find your answer.

Using this query you can "transform" your data values into different columns.

SELECT id,type,time,flowId,nodeId,triggerLogId,modelId FROM
(
SELECT *,
SUBSTR(data,LOCATE('flowId',data)+LENGTH('flowId')+2, LOCATE(',',data,LOCATE('flowId',data)+LENGTH('flowId')+3) - (LOCATE('flowId',data)+LENGTH('flowId')+2)) as flowId,
SUBSTR(data,LOCATE('nodeId',data)+LENGTH('nodeId')+2, LOCATE(',',data,LOCATE('nodeId',data)+LENGTH('nodeId')+3) - (LOCATE('nodeId',data)+LENGTH('nodeId')+2)) as nodeId,
SUBSTR(data,LOCATE('triggerLogId',data)+LENGTH('triggerLogId')+2, LOCATE(',',data,LOCATE('triggerLogId',data)+LENGTH('triggerLogId')+3) - (LOCATE('triggerLogId',data)+LENGTH('triggerLogId')+2)) as triggerLogId,
SUBSTR(data,LOCATE('modelId',data)+LENGTH('modelId')+2, LOCATE('}',data,LOCATE('modelId',data)+LENGTH('modelId')+3) - (LOCATE('modelId',data)+LENGTH('modelId')+2)) as modelId
FROM `my_table`
)
as foo

I will keep trying to find an answer to your problem and update my answer.

UPDATE

Does this solve your problem?

SELECT CAST(GROUP_CONCAT(id) AS CHAR(1000)) as duplicated_rows
FROM `test`
GROUP BY CONCAT(SUBSTR(data,LOCATE('flowId',data)+LENGTH('flowId')+2, LOCATE(',',data,LOCATE('flowId',data)+LENGTH('flowId')+3) - (LOCATE('flowId',data)+LENGTH('flowId')+2)),
                                SUBSTR(data,LOCATE('nodeId',data)+LENGTH('nodeId')+2, LOCATE(',',data,LOCATE('nodeId',data)+LENGTH('nodeId')+3) - (LOCATE('nodeId',data)+LENGTH('nodeId')+2)),
                                SUBSTR(data,LOCATE('modelId',data)+LENGTH('modelId')+2, LOCATE('}',data,LOCATE('modelId',data)+LENGTH('modelId')+3) - (LOCATE('modelId',data)+LENGTH('modelId')+2)))
HAVING COUNT(*) > 1

DELETE DUPLICATES

I'm not even close to be expert in MySQL so probably (most certainly) this is not even close the best answer. In order to remove the duplicate rows and keep the last row (i.e. leave just the row that has the highest id) we need two steps:

Get the list of ids we are going to remove using this query:

SELECT GROUP_CONCAT(delete_rows) as delete_ids FROM ( SELECT SUBSTR(GROUP_CONCAT(id),1,(LENGTH(GROUP_CONCAT(id)) - LOCATE(',',REVERSE(GROUP_CONCAT(id))))) as delete_rows FROM test GROUP BY CONCAT(SUBSTR(data,LOCATE('flowId',data)+LENGTH('flowId')+2, LOCATE(',',data,LOCATE('flowId',data)+LENGTH('flowId')+3) - (LOCATE('flowId',data)+LENGTH('flowId')+2)), SUBSTR(data,LOCATE('nodeId',data)+LENGTH('nodeId')+2, LOCATE(',',data,LOCATE('nodeId',data)+LENGTH('nodeId')+3) - (LOCATE('nodeId',data)+LENGTH('nodeId')+2)), SUBSTR(data,LOCATE('modelId',data)+LENGTH('modelId')+2, LOCATE('}',data,LOCATE('modelId',data)+LENGTH('modelId')+3) - (LOCATE('modelId',data)+LENGTH('modelId')+2)) ) HAVING COUNT(*) > 1) as foo GROUP BY '';
That query will return a list of ids like (1,5,7,8,10). Copy that list in the next query:

DELETE FROM test WHERE id IN

FINAL UPDATE

I found a way to acoomplish this using just one query.

DELETE FROM test WHERE FIND_IN_SET(id,
(SELECT GROUP_CONCAT(delete_rows) as delete_ids 
FROM ( SELECT SUBSTR(GROUP_CONCAT(id),1,(LENGTH(GROUP_CONCAT(id)) - LOCATE(',',REVERSE(GROUP_CONCAT(id))))) as delete_rows 
                FROM test 
                GROUP BY CONCAT(SUBSTR(data,LOCATE('flowId',data)+LENGTH('flowId')+2, LOCATE(',',data,LOCATE('flowId',data)+LENGTH('flowId')+3) - (LOCATE('flowId',data)+LENGTH('flowId')+2)), SUBSTR(data,LOCATE('nodeId',data)+LENGTH('nodeId')+2, LOCATE(',',data,LOCATE('nodeId',data)+LENGTH('nodeId')+3) - (LOCATE('nodeId',data)+LENGTH('nodeId')+2)), SUBSTR(data,LOCATE('modelId',data)+LENGTH('modelId')+2, LOCATE('}',data,LOCATE('modelId',data)+LENGTH('modelId')+3) - (LOCATE('modelId',data)+LENGTH('modelId')+2)) ) 
                HAVING COUNT(*) > 1) as foo 
GROUP BY ''
));

Mysql Regex query to find slightly different duplicates

Answers (1)

Related Questions