Calculate visits from pageviews in MySQL

I am recording each page that is viewed by logged in users in a MySQL table. I would like to calculate how may visits the site has had in within time period (eg day, week, month, between 2 dates etc.) in a similar way to Google Analytics.

Google Analytics defines a visit as user activity separated by at least 30 minutes of inactivity. I have the user ID, URL and date/time of each pageview so I need a query that can calculate a visit defined in this way.

I can easily count the pageviews between 2 dates but how can dynamically work out if a pageview from a user is within 30 minutes of another pageview and only count it once?

Here is a small sample of the data:

http://sqlfiddle.com/#!2/56695/2

Many thanks.

Upvotes: 1

Answers (3)

Taurus Silver

Reputation: 177

Do i'm correct that you want count how many user visit the site within 30 minute for login user but only count as one per user event user visit more page in that period of time? If that so you could filter it then group by period of time visit within 30 minute.

First convert integer timestimp into date by using FROM_UNIXTIME, get minute visit, group minute has past, get period of start and end

SELECT DATE_FORMAT(FROM_UNIXTIME(timestamp), '%e %b %Y %H:%i:%s') visit_time, 
FROM_UNIXTIME(timestamp) create_at, 
MINUTE(FROM_UNIXTIME(timestamp)) create_minute,
MINUTE(FROM_UNIXTIME(timestamp))%30 create_minute_has_past_group,
date_format(FROM_UNIXTIME(timestamp) - interval minute(FROM_UNIXTIME(timestamp))%30 minute, '%H:%i') as period_start,
date_format(FROM_UNIXTIME(timestamp) + interval 30-minute(FROM_UNIXTIME(timestamp))%30 minute, '%H:%i') as period_end
FROM uri_history

After that group by period of start and COUNT DISTINCT user

SELECT date_format(FROM_UNIXTIME(timestamp) - interval minute(FROM_UNIXTIME(timestamp))%30 minute, '%H:%i') as period_start,
date_format(FROM_UNIXTIME(timestamp) + interval 30-minute(FROM_UNIXTIME(timestamp))%30 minute, '%H:%i') as period_end,
COUNT(DISTINCT(user)) count
FROM uri_history
GROUP BY period_start
ORDER BY period_start ASC;

I got these from these answer

Upvotes: 0

Booboo

Reputation: 44283

First, I would also index the uri column and make each column "not nullable":

CREATE TABLE IF NOT EXISTS `uri_history` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `user` int(10) unsigned NOT NULL, /* cannot be NULL */
  `timestamp` int(10) unsigned NOT NULL, /* cannot be NULL */
  `uri` varchar(255) NOT NULL, /* cannot be NULL */
  PRIMARY KEY (`id`),
  KEY `user` (`user`),
  KEY `timestamp` (`timestamp`),
  KEY `uri`
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;

However, I am a bit bewildered by your timestamp column having an int(10) definition and values such as 1389223839. I would expect an integer timestamp to be a value created with the UNIX_TIMESTAMP function call, but 1389223839 would then represent a value of '2014-01-08 18:30:39' for the 'America/New_York' time zone. I would have expected a sample timestamp to be more "contemporary." But I will have to assume that this column is a Unix timestamp value.

Let's say I was interested in gathering statistics for the month of June of this year:

SELECT * FROM uri_history
WHERE DATE(FROM_UNIXTIME(`timestamp`)) between '2022-06-01' and '2022-06-30'
ORDER BY `uri`, `user`, `timestamp`

From this point on I would process the returned rows in sequence recognizing breaks on the uri and user columns. For any returned uri and user combination, it should be very simple to compare the successive timestamp values and see if they differ by at least 30 minutes (i.e. 1800 seconds). In Python this would look like:

current_uri = None
current_user = None
current_timestamp = None
counter = None

# Process each returned row:
for row in returned_rows:
    uri = row['uri']
    user = row['user']
    timestamp = row['timestamp']

    if uri != current_uri:
        # We have a new `uri` column:
        if current_uri:
            # Statistics for previous uri:
            print(f'Visits for uri {current_uri} = {counter}')
        current_uri = uri
        current_user = user
        counter = 1
    elif user != current_user:
        # We have a new user for the current uri:
        current_user = user
        counter += 1
    elif timestamp - current_timestamp >= 1800:
        # New visit is at least 30 minutes after the user's
        # previous visit for this uri:
        counter += 1

    current_timestamp = timestamp

# Output final statistics, if any:
if current_uri:
    print(f'Visits for uri {current_uri} = {counter}

Upvotes: 0

BNazaruk

Reputation: 8111

First, note that doing this kind of analysis in SQL is not the best idea indeed. It just has a very high computational complexity. There are many ways of eliminating the complexity from here.

Since we're talking about the analytics data, or something more akin to access logs of a typical web-server, we could as well just add a cookie value to it, and have a simple piece of front-end code that makes this cookie and gives it a random id, unless the cookie already exists. And sets the expiry of the cookie to whatever you want your session to be, which is 30 minutes by default. Note that you can change your session length in GA. Now your task is as simple as counting unique ids grouped by user. The complexity of N. The favourite complexity of most DBMSes.

Now if you really want to be solving the gaps-and-islands problem, you can just look at classical solutions of the problem, as well as some examples here on SO: SQL Server - Counting Sessions - Gaps and islands

Finally, the 'proper' way of tracking the session id would be generating a random string on every hit and setting it to a certain custom dimension, while having it as a session-level dimension for GA UA. Here's a more detailed explanation.

GA4 is gracious enough to surface the session id more properly, and here is how.

Upvotes: 1

Calculate visits from pageviews in MySQL

Answers (3)

Related Questions