Reputation: 311
I am really stuck on how to approach adding columns to Pandas dynamically. I've been trying to search for an answer to work through this, however, I am afraid when searching I may also be using the wrong terminology to summarize what I am attempting to do.
I have a dataframe returned from a query that looks like the following:
department action date
marketing close 09-01-2017
marketing close 07-01-2018
marketing close 06-01-2017
marketing close 10-21-2019
marketing open 08-01-2018
marketing other 07-14-2018
sales open 02-01-2019
sales open 02-01-2017
sales close 02-22-2019
The ultimate goal is I need a count of the types of actions grouped within particular date ranges.
My DESIRED output is something along the lines of:
department 01/01/2017-12/31/2017 01/01/2018-12/31/2018 01/01/2019-12/31/2019
open close other open close other open close other
marketing 0 2 0 1 1 1 0 1 0
sales 1 0 0 0 0 0 1 1 0
"Department" would be my index, then the contents would be filtered by date ranges specified in a list I provide, followed by the action taken (with counts). Being newer to this, I am confused as to what approach I should take - for example should I use Python (should I be looping or iterating), or should the heavy lifting be done in PANDAS. If in PANDAS, I am having difficulty determining what function to use (I've been looking at get_dummy() etc.).
I'd imagine this would be accomplished with either 1. Some type or FOR loop iterating through, 2. Adding a column to the dataframe based on the list then filtering the data underneath based on the value(s), or 3. using a function I am not aware of in Pandas
I have explained more of my thought process in this question, but I am not sure if the question is unclear which is why it may be unanswered.
Building a dataframe with dynamic date ranges using filtered results from another dataframe
Upvotes: 0
Views: 547
Reputation: 2887
There are quite a few concepts you need at once here.
First you dont yet have the count. From your desired output I took you want it yearly but you can specify any time frame you want. Then just count with groupby()
and count()
In [66]: df2 = df.groupby([pd.to_datetime(df.date).dt.year, "action", "department"]).count().squeeze().rename("count")
Out[66]:
date action department
2017 close marketing 2
open sales 1
2018 close marketing 1
open marketing 1
other marketing 1
2019 close marketing 1
sales 1
open sales 1
Name: count, dtype: int64
The squeeze()
and rename()
are there because afterwards both the count column and the year would be called date
and you get a name conflict. You could equivalently use rename(columns={'date': 'count'})
and not cast to a Series
.
The second step is a pivot_table
. This creates column names from values. Because there are combinations of date
and action
without a corresponding value, you need pivot_table
.
In [62]: df2.pivot_table(index="department", columns=["date", "action"])
Out[62]:
count
date 2017 2018 2019
action close open close open other close open
department
marketing 2.0 NaN 1.0 1.0 1.0 1.0 NaN
sales NaN 1.0 NaN NaN NaN 1.0 1.0
Because NaN
is internally representated as floating piont, your counts were also converted to floating point. To fix that, just append fillna
and convert back to int
.
In [65]: df2.reset_index().pivot_table(index="department", columns=["date", "action"]).fillna(0).astype(int)
Out[65]:
count
date 2017 2018 2019
action close open close open other close open
department
marketing 2 0 1 1 1 1 0
sales 0 1 0 0 0 1 1
To get exactly you output you would need to modify pd.to_datetime(df.date).dt.year
. You can do this with strftime
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.strftime.html). Furthermore the column ["2017", "other"] was dropped because there was no value. If this creates problems you need to include the values beforehand. After the first step a reindex
and a fillna
should do the trick.
EDIT: Yes it does
In [77]: new_index = pd.MultiIndex.from_product([[2017, 2018, 2019], ["close", "open", "other"], ['marketing', 'sales']], names=['date', 'action', 'department'])
...:
In [78]: df3 = df2.reindex(new_index).fillna(0).astype(int).reset_index()
Out[78]:
date action department count
0 2017 close marketing 2
1 2017 close sales 0
2 2017 open marketing 0
3 2017 open sales 1
4 2017 other marketing 0
5 2017 other sales 0
6 2018 close marketing 1
.. ... ... ... ...
11 2018 other sales 0
12 2019 close marketing 1
13 2019 close sales 1
14 2019 open marketing 0
15 2019 open sales 1
16 2019 other marketing 0
17 2019 other sales 0
In [79]: df3.pivot_table(index="department", columns=["date", "action"])
Out[79]:
count
date 2017 2018 2019
action close open other close open other close open other
department
marketing 2 0 0 1 1 1 1 0 0
sales 0 1 0 0 0 0 1 1 0
Upvotes: 1