Reputation: 12718
I'm trying to pivot my measure
column to have its values become fields.
Meaning net_revenue
and vic
should become their own fields.
In the picture below, the Input is on the left, and Desired Output is on the right:
I'm aware that measure
has duplicate keys (e.g., net_revenue
appears more than once), but date_budget
, which I'm also indexing on, is different for that block of data. date_budget
does repeat, but only when measure
has changed, so we never have truly duplicate rows for the indexed columns.
Question: In Pentaho CPython script, when I look at output from the script, I get back only my indexed columns, but not the pivoted columns net_revenue
and vic
. Why is this?
Script:
import pandas as pd
budget['monthly_budget_phasing'] = pd.to_numeric(budget['monthly_budget_phasing'], errors='coerce')
# Perform the pivot.
budget = pd.pivot_table(budget,
values='monthly_budget_phasing',
index=['country', 'customer', 'date_budget'],
columns='measure'
)
budget.reset_index(inplace=True)
result_df = budget
Sample Dataframe:
d = {
'country': ['us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us'],
'customer': ['customer1', 'customer1', 'customer1', 'customer1', 'customer1', 'customer1', 'customer2', 'customer2', 'customer2', 'customer2', 'customer2', 'customer2',],
'measure': ['net_revenue', 'net_revenue', 'net_revenue', 'vic', 'vic', 'vic', 'net_revenue', 'net_revenue', 'net_revenue', 'vic', 'vic', 'vic'],
'date_budget': ['1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018'],
'monthly_budget_phasing': ['$55', '$23', '$42', '$29', '$35', '$98', '$87', '$77', '$34', '$90', '$75', '$12']
}
df = pd.DataFrame(data=d)
Worked in Pandas with aggfunc='first'
, but doesn't work in Pentaho. Pentaho is still outputting only country
, customer
, measure
.
Pandas output from terminal:
country customer date_budget measure monthly_budget_phasing
0 us customer1 1/1/2018 net_revenue $55
1 us customer1 2/1/2018 net_revenue $23
2 us customer1 3/1/2018 net_revenue $42
3 us customer1 1/1/2018 vic $29
4 us customer1 2/1/2018 vic $35
5 us customer1 3/1/2018 vic $98
6 us customer2 1/1/2018 net_revenue $87
7 us customer2 2/1/2018 net_revenue $77
8 us customer2 3/1/2018 net_revenue $34
9 us customer2 1/1/2018 vic $90
10 us customer2 2/1/2018 vic $75
11 us customer2 3/1/2018 vic $12
measure country customer date_budget net_revenue vic
0 us customer1 1/1/2018 $55 $29
1 us customer1 2/1/2018 $23 $35
2 us customer1 3/1/2018 $42 $98
3 us customer2 1/1/2018 $87 $90
4 us customer2 2/1/2018 $77 $75
5 us customer2 3/1/2018 $34 $12
Even though above Python works, Pentaho 8.0 CPython plugin is still causing issues.
First I melt dates:
Then I unmelt measures:
Where are my net_revenue and vic fields?
Upvotes: 1
Views: 432
Reputation: 1030
Kettle needs to know columns, which every step produces before the transformation runs - that's why I don't think it can be done in Python (select * queries are sort of exception, but they too secretly get metadata before the transformation runs). The usual way to perform pivot in Kettle is to use Row denormalizer
step. The step requires you to specify column names for unpivoted values, but if you can't hard-code the values, it is possible to pass them based on your data via ETL Metadata Injection
step.
In order to pass values dynamically create 2 transformations: the sub-transformation will get input data from the parent transformation and perform pivot operation via row denormalizer. The parent transformation will read input data, get unique values, which will become column names, then pass these values to ETL Metadata Injection step. The injection step will fill the row denormalizer metadata with the column names and execute the transformation, feeding your input data.
Upvotes: 0
Reputation: 862741
It seems you need add replace
:
budget['monthly_budget_phasing'] = pd.to_numeric(budget['monthly_budget_phasing'].replace('\$','', regex=True), errors='coerce')
#alternative
#budget['monthly_budget_phasing'] = budget['monthly_budget_phasing'].replace('\$','', regex=True).astype(int)
df = pd.pivot_table(budget,
values='monthly_budget_phasing',
index=['country', 'customer', 'date_budget'],
columns='measure',
aggfunc='first'
).reset_index()
Alternative:
cols = ['country', 'customer', 'date_budget', 'measure']
#if duplicates, first remove it
df = budget.drop_duplicates(cols)
#pivot by unstack
df = df.set_index(cols)['monthly_budget_phasing'].unstack().reset_index()
print (df)
measure country customer date_budget net_revenue vic
0 us customer1 1/1/2018 55 29
1 us customer1 2/1/2018 23 35
2 us customer1 3/1/2018 42 98
3 us customer2 1/1/2018 87 90
4 us customer2 2/1/2018 77 75
5 us customer2 3/1/2018 34 12
Upvotes: 1