bigstylee
bigstylee

Reputation: 1240

mysql update with a self referencing query

I have a table of surveys which contains (amongst others) the following columns

survey_id  - unique id
user_id    - the id of the person the survey relates to
created    - datetime
ip_address - of the submission
ip_count   - the number of duplicates

Due to a large record set, its impractical to run this query on the fly, so trying to create an update statement which will periodically store a "cached" result in ip_count.

The purpose of the ip_count is to show the number of duplicate ip_address survey submissions have been recieved for the same user_id with a 12 month period (+/- 6months of created date).

Using the following dataset, this is the expected result.

survey_id   user_id    created    ip_address     ip_count  #counted duplicates survey_id
  1            1      01-Jan-12   123.132.123       1      # 2
  2            1      01-Apr-12   123.132.123       2      # 1, 3
  3            2      01-Jul-12   123.132.123       0      # 
  4            1      01-Aug-12   123.132.123       3      # 2, 6
  6            1      01-Dec-12   123.132.123       1      # 4

This is the closest solution I have come up with so far but this query is failing to take into account the date restriction and struggling to come up with an alternative method.

UPDATE surveys
JOIN(
  SELECT ip_address, created, user_id, COUNT(*) AS total
  FROM surveys  
  WHERE surveys.state IN (1, 3) # survey is marked as completed and confirmed
  GROUP BY ip_address, user_id
) AS ipCount 
  ON (
    ipCount.ip_address = surveys.ip_address
    AND ipCount.user_id = surveys.user_id
    AND ipCount.created BETWEEN (surveys.created - INTERVAL 6 MONTH) AND (surveys.created + INTERVAL 6 MONTH)
  )
SET surveys.ip_count = ipCount.total - 1 # minus 1 as this query will match on its own id.
WHERE surveys.ip_address IS NOT NULL # ignore surveys where we have no ip_address

Thank you for you help in advance :)

Upvotes: 4

Views: 5080

Answers (2)

bigstylee
bigstylee

Reputation: 1240

A few (very) minor tweaks to what is shown above. Thank you again!

UPDATE surveys AS s
INNER JOIN (
  SELECT x, count(*) c
  FROM (
    SELECT s1.id AS x, s2.id AS y
    FROM surveys AS s1, surveys AS s2
    WHERE s1.state IN (1, 3) # completed and verified
      AND s1.id != s2.id # dont self join
      AND s1.ip_address != "" AND s1.ip_address IS NOT NULL # not interested in blank entries
      AND s1.ip_address = s2.ip_address
      AND (s2.created BETWEEN (s1.created - INTERVAL 6 MONTH) AND (s1.created + INTERVAL 6 MONTH))
      AND s1.user_id = s2.user_id # where completed for the same user
  ) AS ipCount
  GROUP BY x
) n on s.id = n.x
SET s.ip_count = n.c

Upvotes: 3

Aditya Mukherji
Aditya Mukherji

Reputation: 9256

I don't have your table with me, so its hard for me to form correct sql that definitely works, but I can take a shot at this, and hopefully be able to help you..

First I would need to take the cartesian product of surveys against itself and filter out the rows I don't want

select s1.survey_id x, s2.survey_id y from surveys s1, surveys s2 where s1.survey_id != s2.survey_id and s1.ip_address = s2.ip_address and (s1.created and s2.created fall 6 months within each other)

The output of this should contain every pair of surveys that match (according to your rules) TWICE (once for each id in the 1st position and once for it to be in the 2nd position)

Then we can do a GROUP BY on the output of this to get a table that basically gives me the correct ip_count for each survey_id

(select x, count(*) c from (select s1.survey_id x, s2.survey_id y from surveys s1, surveys s2 where s1.survey_id != s2.survey_id and s1.ip_address = s2.ip_address and (s1.created and s2.created fall 6 months within each other)) group by x)

So now we have a table mapping each survey_id to its correct ip_count. To update the original table, we need to join that against this and copy the values over

So that should look something like

UPDATE surveys SET s.ip_count = n.c from surveys s inner join (ABOVE QUERY) n on s.survey_id = n.x

There is some pseudo code in there, but I think the general idea should work

I have never had to update a table based on the output of another query myself before.. Tried to guess the right syntax for doing this from this question - How do I UPDATE from a SELECT in SQL Server?

Also if I needed to do something like this for my own work, I wouldn't attempt to do it in a single query.. This would be a pain to maintain and might have memory/performance issues. It would be best have a script traverse the table row by row, update on a single row in a transaction before moving on to the next row. Much slower, but simpler to understand and possibly lighter on your database.

Upvotes: 2

Related Questions