Reputation: 15
This is a follow up to my only other question, but hopefully more direct. I need data that looks like this:
custID custChannel custDate
1 151 Direct 2015-10-10 00:15:32
2 151 GooglePaid 2015-10-10 00:16:45
3 151 Converted 2015-10-10 00:17:01
4 5655 BingPaid 2015-10-11 00:20:12
5 7855 GoogleOrganic 2015-10-12 00:05:32
6 7862 YahooOrganic 2015-10-13 00:18:20
7 9655 GooglePaid 2015-10-13 00:08:35
8 9655 GooglePaid 2015-10-13 00:11:11
9 9655 Converted 2015-10-13 00:11:35
10 9888 GooglePaid 2015-10-14 00:08:35
11 9888 GooglePaid 2015-10-14 00:11:11
12 9888 Converted 2015-10-14 00:11:35
To be sorted so that the output looks like this:
Path Path Count
BingPaid 1
Direct>GooglePaid>Converted 1
GoogleOrganic 1
GooglePaid>GooglePaid>Converted 2
YahooOrganic 1
The idea is to capture customer paths (as identified by custID) and count for the entire data set how many people took that exact path (Path Count). I need to perform this over a data set of 5 million rows.
Upvotes: 1
Views: 38
Reputation: 21497
Using data.table
you can do this as follows:
require(data.table)
setDT(dat)[,paste(custChannel, collapse = ">"), custID][,.("path length"=.N), .(path=V1)]
Result:
path path length
1: Direct>GooglePaid>Converted 1
2: BingPaid 1
3: GoogleOrganic 1
4: YahooOrganic 1
5: GooglePaid>GooglePaid>Converted 2
Step by step:
setDT(dat) # make dat a data.table
# get path by custID
dat_path <- dat[,paste(custChannel, collapse = ">"), custID]
#get length by path created in the previous step
res <- dat_path[,.("path length"=.N), by=.(path=V1)]
Have a look at dat_path
and res
to understand what happened.
Upvotes: 2