Reputation: 1015
I would like to create a click stream application using HBase, in sql this would be a pretty simple task but in Hbase I have not got the first clue. Can someone advise me on a schema design and keys to use in HBase.
I have provided a rough data model and several questions that I would like to interrogate the data for.
Questions I would like to ask for accessing data
What events led to a conversion? What was the last page / How many paged viewed? What pages a customer drops off? What products does a male customer between 20 and 30 like to buy? A customer has bought product x also likely to buy product y? Conversion amount from first page ?
{
PageViews: [
{
date: "19700101 00:00",
domain: "http://foobar.com",
path: "pageOne.html",
timeOnPage: "10",
pageViewNumber: 1,
events: [
{ name: "slideClicked", value: 0, time: "00:00"},
{ name: "conversion", value: 100, time: "00:05"}
],
pageData: {
category: "home",
pageTitle: "Home Page"
}
},
{
date: "19700101 00:01",
domain: "http://foobar.com",
path: "pageTwo.html",
timeOnPage: "20",
pageViewNumber: 2,
events: [
{ name: "addToCart", value: 50.00, time: "00:02"}
],
pageData: {
category: "product",
pageTitle: "Mans Shirt",
itemValue: 50.00
}
},
{
date: "19700101 00:03",
domain: "http://foobar.com",
path: "pageThree.html",
timeOnPage: "30",
pageViewNumber: 3,
events: [],
pageData: {
category: "basket",
pageTitle: "Checkout"
}
}
],
Customer: {
IPAddress: 127.0.0.1,
Browser: "Chrome",
FirstName: "John",
LastName: "Doe",
Email: "[email protected]",
isMobile: 1,
returning: 1,
age: 25,
sex: "Male"
}
}
Upvotes: 0
Views: 604
Reputation: 7138
Well, you data is mainly in one-to-many relationship. One customer and an array of page view entities. And since all your queries are customer centric, it makes sense to store each customer as a row in Hbase and have customerid(may be email in your case) as part of row key.
If you decide to store one row for one customer, each page view details would be stored as nested. The video link regarding hbase design will help you understand that. So for you above example, you get one row, and three nested entities
Another approach would be, denormalized form, for hbase to perform good lookup. Here each row would be page view, and customer data gets appended for every row.So for your above example, you end up with three rows. Data would be duplicated. Again the video gives info regarding that too(compression things).
You have more nested levels inside each page view - live events and pagedata. So it will only get worse, with respect to denormalization. As everything in Hbase is a key value pair, it is difficult to query and match these nested levels. Hope this helps you to kick off
Good video link here
Upvotes: 3