Reputation: 73
I am trying to count the number of times users look at pages in the same session.
I am starting with a data frame listing user_ids and the page slugs they have visited:
user_id page_view_page_slug
1 slug1
1 slug2
1 slug3
1 slug4
2 slug5
2 slug3
2 slug2
2 slug1
What I am looking to get is a pivot table counting user_ids of the cross section of slugs
. | slug1 | slug2 | slug3 | slug4 | slug5 |
---|---|---|---|---|---|
slug1 | 2 | 2 | 2 | 1 | 1 |
slug2 | 2 | 2 | 2 | 1 | 1 |
slug3 | 2 | 2 | 2 | 1 | 1 |
slug4 | 1 | 1 | 1 | 1 | 0 |
slug5 | 1 | 1 | 1 | 0 | 1 |
I realize this will be the same data reflected when we see slug1 and slug2 vs slug2 and slug1 but I can't think of a better way. So far I have done a listagg
def listagg(df, grouping_idx):
return df.groupby(grouping_idx).agg(list)
new_df = listagg(df,'user_id')
Returning:
page_view_page_slug
user_id
1 [slug1, slug2, slug3, slug4]
2 [slug5, slug3, slug2, slug2]
7 [slug6, slug4, slug7]
9 [slug3, slug5, slug1]
But I am struggling to think of loop to count when items appear in a list together (despite the order) and how to store it. Then I also do not know how I would get this in a pivotable format.
Upvotes: 7
Views: 510
Reputation: 153500
Let's use self-join on user_id with merge
and pd.crosstab
to count:
import pandas as pd
from io import StringIO
txt = StringIO("""user_id page_view_page_slug
1 slug1
1 slug2
1 slug3
1 slug4
2 slug5
2 slug3
2 slug2
2 slug1""")
df = pd.read_csv(txt, sep='\s\s+')
dfm = df.merge(df, on='user_id')
df_out = pd.crosstab(dfm['page_view_page_slug_x'], dfm['page_view_page_slug_y'])
df_out
Output:
page_view_page_slug_y slug1 slug2 slug3 slug4 slug5
page_view_page_slug_x
slug1 2 2 2 1 1
slug2 2 2 2 1 1
slug3 2 2 2 1 1
slug4 1 1 1 1 0
slug5 1 1 1 0 1
For repetition of data, let's try:
dfi = df.assign(v_count=df.groupby(['user_id', 'page_view_page_slug']).cumcount())
#Let's filter some unnecessary joins with query
dfi = dfi.merge(dfi, on=['user_id'])\
.query('page_view_page_slug_x != page_view_page_slug_y or page_view_page_slug_x == page_view_page_slug_y and v_count_x == v_count_y')
df_out = pd.crosstab(dfi['page_view_page_slug_x'], dfi['page_view_page_slug_y'])
df_out
Output:
page_view_page_slug_y slug1 slug2 slug3 slug4 slug5
page_view_page_slug_x
slug1 3 3 3 2 1
slug2 3 2 2 1 1
slug3 3 2 2 1 1
slug4 2 1 1 1 0
slug5 1 1 1 0 1
Upvotes: 1
Reputation: 71687
Here is another way by using numpy broadcasting to create a matrix which is obtained by comparing each value in user_id
with every other value, then create a new dataframe from this matrix with index
and columns
set to page_view_page_slug
and take sum
on level=0
along axis=0
and axis=1
to count the user_ids
of the cross section of slugs:
a = df['user_id'].values
i = list(df['page_view_page_slug'])
pd.DataFrame(a[:, None] == a, index=i, columns=i)\
.sum(level=0).sum(level=0, axis=1).astype(int)
slug1 slug2 slug3 slug4 slug5
slug1 2 2 2 1 1
slug2 2 2 2 1 1
slug3 2 2 2 1 1
slug4 1 1 1 1 0
slug5 1 1 1 0 1
Upvotes: 3
Reputation: 71687
Let's try groupby
and reduce
:
from functools import reduce
dfs = [pd.DataFrame(1, index=list(s), columns=list(s))
for _, s in df.groupby('user_id')['page_view_page_slug']]
df_out = reduce(lambda x, y: x.add(y, fill_value=0), dfs).fillna(0).astype(int)
Details:
group
the dataframe on user_id
then for each group in page_view_page_slug
per user_id
create an adjacency dataframe with index and columns corresponding to the slugs
in that group.
>>> dfs
[ slug1 slug2 slug3 slug4
slug1 1 1 1 1
slug2 1 1 1 1
slug3 1 1 1 1
slug4 1 1 1 1,
slug5 slug3 slug2 slug1
slug5 1 1 1 1
slug3 1 1 1 1
slug2 1 1 1 1
slug1 1 1 1 1]
Now reduce
the above adjacency dataframes using a reduction function DataFrame.add
with optional parameter fill_value=0
so as to count the user_ids of the cross section of slugs.
>>> df_out
slug1 slug2 slug3 slug4 slug5
slug1 2 2 2 1 1
slug2 2 2 2 1 1
slug3 2 2 2 1 1
slug4 1 1 1 1 0
slug5 1 1 1 0 1
Optionally you can wrap the above code in a function as follows:
def count():
df_out = pd.DataFrame()
for _, s in df.groupby('user_id')['page_view_page_slug']:
df_out = df_out.add(
pd.DataFrame(1, index=list(s), columns=list(s)), fill_value=0)
return df_out.fillna(0).astype(int)
>>> count()
slug1 slug2 slug3 slug4 slug5
slug1 2 2 2 1 1
slug2 2 2 2 1 1
slug3 2 2 2 1 1
slug4 1 1 1 1 0
slug5 1 1 1 0 1
Upvotes: 2