Hemanth Annavarapu
Hemanth Annavarapu

Reputation: 917

Clickstream data analysis

I came across an interesting scenario called clickstream data analysis. All I know is what is clickstream data. I would like to know more about this and different scenarios in which it can be used in the best interests of the business and the set of tools we need to process the data in different steps of each scenario.

Upvotes: 2

Views: 3893

Answers (3)

Sayali Sonawane
Sayali Sonawane

Reputation: 12609

What is Clickstream Data?

It is a virtual trail that a user leaves behind while surfing the Internet. A clickstream is a record of a user's activity on the Internet, including every Web site and every page of every Web site that the user visits, how long the user was on a page or site, in what order the pages were visited, any newsgroups that the user participates in and even the e-mail addresses of mail that the user sends and receives. Both ISPs and individual Web sites are capable of tracking a user's clickstream.

enter image description here

Clickstream data may include information like: browser height-width,browser name,browser language, device type (desktop,laptop,tablets,mobile),Revenue,Day,Timestamp,IP address, URL,Number of products added in cart, number of products removed,State,Country,Billing zip code,Shipping zip code,etc.

How can we extract more information from Clickstream data?

In the web analytics realm, site visitors and potential customers are the equivalent of subjects in a subject-based data set. Consider a following clickstream data example, a subject-based dataset is structured in rows and columns (like an Excel spreadsheet) — each row of the data set is a unique subject and each column is some piece of information about that subject. If you want to do customer-based analysis, you will need a customer based data set. In its most granular form, clickstream data looks like the chart below. Hits from the same visitor have been color coded together. enter image description here

Data Scientists derive more features from clickstream data. For each visitor, we have several hits within a visit, and over an extended period of time we have a collection of visits. We need a way to organize the data at the visitor level. Something like this:

enter image description here Obviously, there are many different ways you could aggregate the data. For numeric data like page views, revenue and video views, we may want to use something like an average or total. By doing this we get more information about customer behavior. If you will observe aggregated chart, you can easily tell that company is making more revenue on Friday.

Once you have obtained a customer-based data set, there are a number of different statistical models and data science techniques that can allow you to access deeper, more meaningful analysis at the visitor level. Data Science Consulting has expertise and experience in leveraging these methods to:

  • Predict which customers are at the highest risk for churn and determine the factors that are affecting that risk (allows you to be proactive in retaining your customer base)

  • Understand the level of brand awareness of individual customers

  • Target customers with individualized, relevant offers

  • Anticipate which customers are most likely to convert and statistically determine how your site is influencing that decision

  • Determine the types of site content that visitors are most likely to respond to and understand how content engagement drives high-value visits

  • Define the profiles and characteristics of the different personas of visitors coming to your site, and understand how to engage with them.

You may also be interested in the following Coursera course:

https://www.coursera.org/learn/process-mining?recoOrder=6&utm_medium=email&utm_source=recommendations&utm_campaign=recommendationsEmail~recs_email_2016_06_26_17%3A57

It's on process mining, which has click trace analysis as a special case, I think.

Upvotes: 12

Ramkumar Venkataraman
Ramkumar Venkataraman

Reputation: 868

The following can give a high-level picture of what most companies do:

  1. Ingestion REST-ful API for clients to pass in events
  2. Pump thee events to Kafka
  3. Spark streaming to do real-time computations
  4. Gobblin (or similar) to pump data from Kafka to HDFS, then run batch M/R jobs on HDFS
  5. Both real-time and batch jobs pump the computed metrics to Druid (Lambda architecture)
  6. UI for end-user reports/dashboards
  7. Nagios (or similar) for alerting
  8. Metrics aggregation framework, which tracks events through every layer in our stack

From my experience, it is better to start with fairly mature tools and do a POC end to end, then look at other tools that you can play around with. Example, as your pipeline starts maturing, you could even have an asynchronous ingestion API (written in scala/akka), Kafka streams to do inline events transformations, Flink for both real-time and batch jobs, etc.

Upvotes: 3

Luis Leal
Luis Leal

Reputation: 3534

Maybe you can take a look on spark courses on EDX, they use clickstream examples with spark for analysis and machine learning.

Upvotes: -2

Related Questions