user3871
user3871

Reputation: 12718

Using pivot table returns only indexed columns, omits pivoted columns

I'm trying to pivot my measure column to have its values become fields.

Meaning net_revenue and vic should become their own fields.

In the picture below, the Input is on the left, and Desired Output is on the right:

enter image description here

I'm aware that measure has duplicate keys (e.g., net_revenue appears more than once), but date_budget, which I'm also indexing on, is different for that block of data. date_budget does repeat, but only when measure has changed, so we never have truly duplicate rows for the indexed columns.

Question: In Pentaho CPython script, when I look at output from the script, I get back only my indexed columns, but not the pivoted columns net_revenue and vic. Why is this?

enter image description here

Script:

import pandas as pd

budget['monthly_budget_phasing'] = pd.to_numeric(budget['monthly_budget_phasing'], errors='coerce')

# Perform the pivot.
budget = pd.pivot_table(budget,
    values='monthly_budget_phasing',
    index=['country', 'customer', 'date_budget'],
    columns='measure'
    )

budget.reset_index(inplace=True)

result_df = budget  

Sample Dataframe:

d = {
    'country': ['us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us'],
    'customer': ['customer1', 'customer1', 'customer1', 'customer1', 'customer1', 'customer1', 'customer2', 'customer2', 'customer2', 'customer2', 'customer2', 'customer2',],
    'measure': ['net_revenue', 'net_revenue', 'net_revenue', 'vic', 'vic', 'vic', 'net_revenue', 'net_revenue', 'net_revenue', 'vic', 'vic', 'vic'],
    'date_budget': ['1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018'],
    'monthly_budget_phasing': ['$55', '$23', '$42', '$29', '$35', '$98', '$87', '$77', '$34', '$90', '$75', '$12']
    }
df = pd.DataFrame(data=d)

Worked in Pandas with aggfunc='first', but doesn't work in Pentaho. Pentaho is still outputting only country, customer, measure.

Pandas output from terminal:

   country   customer date_budget      measure monthly_budget_phasing
0       us  customer1    1/1/2018  net_revenue                    $55
1       us  customer1    2/1/2018  net_revenue                    $23
2       us  customer1    3/1/2018  net_revenue                    $42
3       us  customer1    1/1/2018          vic                    $29
4       us  customer1    2/1/2018          vic                    $35
5       us  customer1    3/1/2018          vic                    $98
6       us  customer2    1/1/2018  net_revenue                    $87
7       us  customer2    2/1/2018  net_revenue                    $77
8       us  customer2    3/1/2018  net_revenue                    $34
9       us  customer2    1/1/2018          vic                    $90
10      us  customer2    2/1/2018          vic                    $75
11      us  customer2    3/1/2018          vic                    $12
measure country   customer date_budget net_revenue  vic
0            us  customer1    1/1/2018         $55  $29
1            us  customer1    2/1/2018         $23  $35
2            us  customer1    3/1/2018         $42  $98
3            us  customer2    1/1/2018         $87  $90
4            us  customer2    2/1/2018         $77  $75
5            us  customer2    3/1/2018         $34  $12

Even though above Python works, Pentaho 8.0 CPython plugin is still causing issues.

First I melt dates:

enter image description here

enter image description here

Then I unmelt measures:

enter image description here

enter image description here

Where are my net_revenue and vic fields?

enter image description here

Upvotes: 1

Views: 432

Answers (2)

Andrei Luksha
Andrei Luksha

Reputation: 1030

Kettle needs to know columns, which every step produces before the transformation runs - that's why I don't think it can be done in Python (select * queries are sort of exception, but they too secretly get metadata before the transformation runs). The usual way to perform pivot in Kettle is to use Row denormalizer step. The step requires you to specify column names for unpivoted values, but if you can't hard-code the values, it is possible to pass them based on your data via ETL Metadata Injection step.

In order to pass values dynamically create 2 transformations: the sub-transformation will get input data from the parent transformation and perform pivot operation via row denormalizer. The parent transformation will read input data, get unique values, which will become column names, then pass these values to ETL Metadata Injection step. The injection step will fill the row denormalizer metadata with the column names and execute the transformation, feeding your input data.

Upvotes: 0

jezrael
jezrael

Reputation: 862741

It seems you need add replace:

budget['monthly_budget_phasing'] = pd.to_numeric(budget['monthly_budget_phasing'].replace('\$','', regex=True), errors='coerce')
#alternative
#budget['monthly_budget_phasing'] = budget['monthly_budget_phasing'].replace('\$','', regex=True).astype(int)


df = pd.pivot_table(budget,
    values='monthly_budget_phasing',
    index=['country', 'customer', 'date_budget'],
    columns='measure',
    aggfunc='first'

    ).reset_index()

Alternative:

cols = ['country', 'customer', 'date_budget', 'measure']
#if duplicates, first remove it
df = budget.drop_duplicates(cols)
#pivot by unstack
df = df.set_index(cols)['monthly_budget_phasing'].unstack().reset_index()

print (df)
measure country   customer date_budget  net_revenue  vic
0            us  customer1    1/1/2018           55   29
1            us  customer1    2/1/2018           23   35
2            us  customer1    3/1/2018           42   98
3            us  customer2    1/1/2018           87   90
4            us  customer2    2/1/2018           77   75
5            us  customer2    3/1/2018           34   12

Upvotes: 1

Related Questions