Using pivot table returns only indexed columns, omits pivoted columns

Question

I'm trying to pivot my measure column to have its values become fields.

Meaning net_revenue and vic should become their own fields.

In the picture below, the Input is on the left, and Desired Output is on the right:

I'm aware that measure has duplicate keys (e.g., net_revenue appears more than once), but date_budget, which I'm also indexing on, is different for that block of data. date_budget does repeat, but only when measure has changed, so we never have truly duplicate rows for the indexed columns.

Question: In Pentaho CPython script, when I look at output from the script, I get back only my indexed columns, but not the pivoted columns net_revenue and vic. Why is this?

Script:

import pandas as pd

budget['monthly_budget_phasing'] = pd.to_numeric(budget['monthly_budget_phasing'], errors='coerce')

# Perform the pivot.
budget = pd.pivot_table(budget,
    values='monthly_budget_phasing',
    index=['country', 'customer', 'date_budget'],
    columns='measure'
    )

budget.reset_index(inplace=True)

result_df = budget

Sample Dataframe:

d = {
    'country': ['us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us'],
    'customer': ['customer1', 'customer1', 'customer1', 'customer1', 'customer1', 'customer1', 'customer2', 'customer2', 'customer2', 'customer2', 'customer2', 'customer2',],
    'measure': ['net_revenue', 'net_revenue', 'net_revenue', 'vic', 'vic', 'vic', 'net_revenue', 'net_revenue', 'net_revenue', 'vic', 'vic', 'vic'],
    'date_budget': ['1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018'],
    'monthly_budget_phasing': ['$55', '$23', '$42', '$29', '$35', '$98', '$87', '$77', '$34', '$90', '$75', '$12']
    }
df = pd.DataFrame(data=d)

Worked in Pandas with aggfunc='first', but doesn't work in Pentaho. Pentaho is still outputting only country, customer, measure.

Pandas output from terminal:

   country   customer date_budget      measure monthly_budget_phasing
0       us  customer1    1/1/2018  net_revenue                    $55
1       us  customer1    2/1/2018  net_revenue                    $23
2       us  customer1    3/1/2018  net_revenue                    $42
3       us  customer1    1/1/2018          vic                    $29
4       us  customer1    2/1/2018          vic                    $35
5       us  customer1    3/1/2018          vic                    $98
6       us  customer2    1/1/2018  net_revenue                    $87
7       us  customer2    2/1/2018  net_revenue                    $77
8       us  customer2    3/1/2018  net_revenue                    $34
9       us  customer2    1/1/2018          vic                    $90
10      us  customer2    2/1/2018          vic                    $75
11      us  customer2    3/1/2018          vic                    $12
measure country   customer date_budget net_revenue  vic
0            us  customer1    1/1/2018         $55  $29
1            us  customer1    2/1/2018         $23  $35
2            us  customer1    3/1/2018         $42  $98
3            us  customer2    1/1/2018         $87  $90
4            us  customer2    2/1/2018         $77  $75
5            us  customer2    3/1/2018         $34  $12

Even though above Python works, Pentaho 8.0 CPython plugin is still causing issues.

First I melt dates:

Then I unmelt measures:

Where are my net_revenue and vic fields?

jezrael · Accepted Answer

It seems you need add replace:

budget['monthly_budget_phasing'] = pd.to_numeric(budget['monthly_budget_phasing'].replace('\$','', regex=True), errors='coerce')
#alternative
#budget['monthly_budget_phasing'] = budget['monthly_budget_phasing'].replace('\$','', regex=True).astype(int)


df = pd.pivot_table(budget,
    values='monthly_budget_phasing',
    index=['country', 'customer', 'date_budget'],
    columns='measure',
    aggfunc='first'

    ).reset_index()

Alternative:

cols = ['country', 'customer', 'date_budget', 'measure']
#if duplicates, first remove it
df = budget.drop_duplicates(cols)
#pivot by unstack
df = df.set_index(cols)['monthly_budget_phasing'].unstack().reset_index()

print (df)
measure country   customer date_budget  net_revenue  vic
0            us  customer1    1/1/2018           55   29
1            us  customer1    2/1/2018           23   35
2            us  customer1    3/1/2018           42   98
3            us  customer2    1/1/2018           87   90
4            us  customer2    2/1/2018           77   75
5            us  customer2    3/1/2018           34   12

Using pivot table returns only indexed columns, omits pivoted columns

Answers (2)

Related Questions