Reputation: 4498
I have the following python pandas dataframe:
| Number of visits per year |
user id | 2013 | 2014 | 2015 | 2016 |
A 4 3 6 0
B 3 0 7 3
C 10 6 3 0
I want to calculate the percentage of users who returned based on their numbers of visits. I am sorry , I don't have any code yet, I wasn't sure how to start this.
This is the end result I am looking for:
| Number of visits in the year |
Year | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
2014 7% 3% 4% 15% 6% 7% 18% 17% 3% 2%
2015 3% ....
2016
So based on the above I can say that 15% of clients who visited the store 4 times in 2013, came back to the store in 2014.
Thank you very much.
UPDATE: This is what I did, maybe there is a better way through a loop?
For each year, I had a csv like this:
user_id | NR_V
A 4
B 3
C 10
NR_V stands for number of visits.
So I uploaded each csv as it's own df and I had df_2009, df_2010, ... until df_2016.
For each file I added a column column with 0/1 if they shopped the next year.
df_2009['shopped2010'] = np.where(df_2009['user_ID'].isin(df_2010['user_ID']), 1, 0)
Then I pivoted each dataframe.
pivot_2009 = pd.pivot_table(df_2009,index=["NR_V"],aggfunc={"NR_V":len, "shopped2010":np.sum})
Next, for each dataframe I created a new dataframe with the a column calculating the percentage by number of visits.
p_2009 = pd.DataFrame()
p_2009['%returned2010'] = (pivot_2009['shopped2010']/pivot_2009['NR_V'])*100
Finally, I merged all those dataframes into one.
dfs = [p_2009, p_2010, p_2011, p_2012, p_2013, p_2014, p_2015 ]
final = pd.concat(dfs, axis=1)
Upvotes: 1
Views: 2374
Reputation: 71
I used the index value of every visitor and checked if the same index value (aka the same vistor_ID) was more then 0 the next year. This was then added to a dictionary in the form of True or False, which you could use for a bar-chart. I also made two lists (times_returned and returned_at_all) for additional data manipulation.
import pandas as pd
# Part 1, Building the dataframe.
df = pd.DataFrame({
'Visitor_ID':[1,2,3],
'2010' :[4,3,10],
'2011' :[3,0,6],
'2012' :[6,7,3],
'2013' :[0,3,0]
})
df.set_index("Visitor_ID", inplace=True)
# Part 2, preparing the required variables.
def dictionary (max_visitors):
dictionary={}
for x in range(max_visitors):
dictionary["number_{}".format(x)] = []
# print(dictionary)
return dictionary
# Part 3, Figuring out if the customer returned.
def compare_yearly_visits(current_year, next_year):
index = 1
years = df.columns
for x in df[current_year]:
# print (df[years][current_year][index], 'this year.')
# print (df[years][next_year][index], 'Next year.')
how_many_visits = df[years][current_year][index]
did_he_return = df[years][next_year][index]
if did_he_return > 0:
# If the visitor returned, add to a bunch of formats:
returned_at_all.append([how_many_visits, True])
times_returned.append([how_many_visits, did_he_return])
dictionary["number_{}".format(x)].append(True)
else:
## If the visitor did not return, add to a bunch of formats:
returned_at_all.append([how_many_visits, False])
dictionary["number_{}".format(x)].append(False)
index = index +1
# Part 4, The actual program:
highest_amount_of_visits = 11 # should be done automatically, max(visits)?
relevant_years = len(df.columns) -1
times_returned = []
returned_at_all = []
dictionary = dictionary(highest_amount_of_visits)
for column in range(relevant_years):
# print (dictionary)
this_year = df.columns[column]
next_year = df.columns[column+1]
compare_yearly_visits(this_year, next_year)
print ("cumulative dictionary up to:", this_year,"\n", dictionary)
Upvotes: 0
Reputation: 294488
Consider the sample visits dataframe df
df = pd.DataFrame(
np.random.randint(1, 10, (100, 5)),
pd.Index(['user_{}'.format(i) for i in range(1, 101)], name='user id'),
[
['Number of visits per year'] * 5,
[2012, 2013, 2014, 2015, 2016]
]
)
df.head()
You can apply pd.value_counts
with parameter normalize=True
.
Also, since an entry of 8
represents 8 separate visits, it should count 8 times. I'll use repeat
to accomplish this prior to value_counts
def count_visits(col):
v = col.values
return pd.value_counts(v.repeat(v), normalize=True)
df.apply(count_visits).stack().unstack(0)
Upvotes: 2
Reputation: 5582
Please find below my solution. As a note, I am pretty positive that this can be improved.
# step 0: create data frame
df = pd.DataFrame({'2013':[4, 3, 10], '2014':[3, 0, 6], '2015':[6, 7, 3], '2016':[0, 3, 0]}, index=['A', 'B', 'C'])
# container list of dataframes to be concatenated
frames = []
# iterate through the dataframe one column at a time and determine its value_counts(freq table)
for name, series in df.iteritems():
frames.append(series.value_counts())
# Merge frequency table for all columns into a dataframe
temp_df = pd.concat(frames, axis=1).transpose().fillna(0)
# Find the key for the new dataframe (i.e. range for number of columns), and append missing ones
cols = temp_df.columns
min = cols.min()
max = cols.max()
for i in range(min, max):
if (not i in a):
temp_df[str(i)] = 0
# Calculate percentage
final_df = temp_df.div(temp_df.sum(axis=1), axis=0)
Upvotes: -1