DataAlchemist
DataAlchemist

Reputation: 187

How to partition records of same key in a specific span and count the number of these partitions in sql?

Chrome History SQlite 3 database has the urls table like this:

CREATE TABLE "urls"(
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  url LONGVARCHAR,
  title LONGVARCHAR,
  visit_count INTEGER DEFAULT 0 NOT NULL,
  typed_count INTEGER DEFAULT 0 NOT NULL,
  last_visit_time INTEGER NOT NULL,
  hidden INTEGER DEFAULT 0 NOT NULL
  )

If you also use Chrome and OS X, you can find it at ~/Library/Application Support/Google/Chrome/Default/History. Other system can see this post: Where does Chrome save its SQLite database to?

I want to count the nubmers of title but in separated time span(like a day).

The simplified table is like below:

  id    | title |  last_visit_time  |
--------+-------+-------------------+
"240255"  "foo"  "13197393082183934"
"240256"  "foo"  "13197393082183934"
"240259"  "foo"  "13197393103031880"
"251615"  "foo"  "13201270874907168"
"251616"  "foo"  "13201270904026269"
"251631"  "foo"  "13201271376777156"
"258228"  "goo"  "13203150937196411"
"258229"  "goo‬"  "13203150937196411"
"258230"  "goo‬"  "13203150937196411"

The last_visit_time is "formatted as the number of microseconds since January, 1601". To a given span (like 86400000000, a day in microseconds), all same title only count 1 time, because they are in a same period.

The part of pseudocode to count for a distinct title in Python style is like:

count = 0
all_last_visit_time_sorted = all_last_visit_time.sort() # generally it should be sorted because the `history` data was added into database in time order.
span = 86400000000 # a day in microseconds
current_base_time = all_last_visit_time_sorted[0]
for last_visit_time in all_last_visit_time_sorted:
    if last_visit_time - current_base_time > span:
          count += 1
          current_base_time = last_visit_time

To be more clear, the partition is like:

  id    | title |  last_visit_time  |
--------+-------+-------------------+
"240255"  "foo"  "13197393082183934"
"240256"  "foo"  "13197393082183934"
"240259"  "foo"  "13197393103031880"
------------------------------------
"251615"  "foo"  "13201270874907168"
"251616"  "foo"  "13201270904026269"
"251631"  "foo"  "13201271376777156"
------------------------------------
"258228"  "goo"  "13203150937196411"
"258229"  "goo‬"  "13203150937196411"
"258230"  "goo‬"  "13203150937196411"

Then this is what I expect:

 title |  count |
-------+--------+
 "foo" |    2   |
 "goo" |    1   |

I don't know how to do these with SQL although after google such keywords as window function,subquery and so on.

p.s: data sample

INSERT INTO urls VALUES ("460534","https://stackoverflow.com/questions/ask","Ask a Question - Stack Overflow","1","0","13197393082183934","0");
INSERT INTO urls VALUES ("460535","https://stackoverflow.com/questions/ask/wizard","Ask a Question - Stack Overflow","1","0","13197393082183965","0");
INSERT INTO urls VALUES ("460536","https://stackoverflow.com/questions/ask?guided=true","Ask a Question - Stack Overflow","1","0","13197393082184036","0");
INSERT INTO urls VALUES ("460537","https://stackoverflow.com/search?q=%5Bhadoop%5D","Highest Voted 'hadoop' Questions - Stack Overflow","1","0","13196783835451652","0");
INSERT INTO urls VALUES ("460538","https://stackoverflow.com/questions/tagged/hadoop","Highest Voted 'hadoop' Questions - Stack Overflow","1","0","13196783835452653","0");
INSERT INTO urls VALUES ("460539","https://stackoverflow.com/questions/ask","Ask a Question - Stack Overflow","1","0","13407393082183934","0");
INSERT INTO urls VALUES ("460540","https://stackoverflow.com/questions/ask/wizard","Ask a Question - Stack Overflow","1","0","13407393082184035","0");
INSERT INTO urls VALUES ("460541","https://stackoverflow.com/questions/ask?guided=true","Ask a Question - Stack Overflow","1","0","13407393082184236","0");

Upvotes: 0

Views: 157

Answers (1)

Andrew
Andrew

Reputation: 8758

To me, this is just another flavor of Gaps and Islands. You have to compare each row's last_visit_time to the previous row's. If they are within a day of each other (86400000000), then you want to count them as 1 visit. We'll derive a column to indicate if the current row is within 1 day of the previous row.

Next, you need to take that newly derived column (gap in the dbfiddle), and do a running sum on it. Each time the gap goes to 1, you'll see that running sum go up. So all the rows where the running sum are the same, you'll count them together.

Finally, you can do a count distinct on that grouper column, and you'll get your output.

Here's a DBFiddle (with a few more rows added to your first set of sample data to get a couple more days in there), broken down step by step.

Here's the final (ugly) query:

with fst as (
select
id, title,
last_Visit_time,
lag(last_visit_time,1,last_visit_time) over (partition by title order by last_visit_time) ,
case 
when last_visit_time - lag(last_visit_time,1,last_visit_time) over (partition by title order by last_visit_time)  < 86400000000 
then 0
else 1
end as gap

from
<your table>
order by last_visit_time)


select
 title,
 count (distinct grouper)
 from (
select
id,
title,
gap,
sum(gap) over (partition by title order by last_visit_time rows unbounded preceding) as grouper
from 
fst) t

Upvotes: 1

Related Questions