Yumi
Yumi

Reputation: 241

save Pandas dataframe values in a dictionary given criteria

I have a pandas df that contains numbers of visitors and the paths they chose before completing the conversion goal. Each row represents the path and the numbers of visitors chose the path, for example, row1: 18 visitors visited '(entrance)' --> '/' --> '/ContactUS'/Default.aspx' before reaching the target goal

I'm only interested in the product page that a visitor was at last and I'm trying to create a dictionary that takes product name, such as 'VFB25AEH' as the key and # of visits as the value

Step1                        Step2                  Step3              Visits 
/ContactUs/Default.aspx        /                    (entrance)          18    
/Products/GBR100L.aspx  /Products/VFB25AEH.aspx   /Products/RAD80L.aspx  9    
/Products/VFB25AEH.aspx  (entrance)                 (not set)            5    
/Products/RAD80L.aspx    (entrance)                 (not set)            4

The following is my code that loops through each column of each row, and save the first product page (step that contains '/Products/') and save the total number of visits in a dictionary

result = {}
for i, row in enumerate(df.values):
    for c in row:
        if 'products' in str(c).lower():
            c = c.strip('.aspx').split('/')[2]
            if c in result:
                result[c]+= 1
            result[c] = 1

Ideal result is - result['VFB25AEH'] = 5, result['RAD80L'] = 4, result['GBR100L']=9

but, it turns out that the values in result were all '1'. Can someone help point out the error here??

Upvotes: 0

Views: 1479

Answers (1)

Samuel Littley
Samuel Littley

Reputation: 734

The last 3 lines of your code reset result[c] back to 1 every iteration. Instead you need:

if c in result:
    result[c] += 1
else:
    result[c] = 1

You could alternatively use collections.defaultdict

import collections

result = collections.defaultdict(int)
for i, row in enumerate(df.values):
    for c in row:
        if 'products' in str(c).lower():
            c = c.strip('.aspx').split('/')[2]
            result[c] += 1

EDIT

Taking into account the requirement to sum up the number of visits, and take only the most recent product page visited:

import collections

result = collections.defaultdict(int)
for row in df.values:
    for c in row:
        if 'products' in str(c).lower():
            c = c.strip('.aspx').split('/')[2]

            # The number of visits is in the last entry in the row
            result[c] += row[-1]

            # We've found the most recent product page, so move on to the next row
            break

You don't actually need the call to enumerate() - you weren't using the index at all.

Upvotes: 1

Related Questions