Dynamic Column Data In Pandas Dataframe (Populated By Dataframe)

Question

I am really stuck on how to approach adding columns to Pandas dynamically. I've been trying to search for an answer to work through this, however, I am afraid when searching I may also be using the wrong terminology to summarize what I am attempting to do.

I have a dataframe returned from a query that looks like the following:

department  action  date
marketing   close   09-01-2017
marketing   close   07-01-2018
marketing   close   06-01-2017
marketing   close   10-21-2019
marketing   open    08-01-2018
marketing   other   07-14-2018
sales       open    02-01-2019
sales       open    02-01-2017
sales       close   02-22-2019

The ultimate goal is I need a count of the types of actions grouped within particular date ranges.

My DESIRED output is something along the lines of:

department  01/01/2017-12/31/2017  01/01/2018-12/31/2018  01/01/2019-12/31/2019
            open close other       open close other       open  close other
marketing   0    2     0           1    1     1           0     1     0
sales       1    0     0           0    0     0           1     1     0

"Department" would be my index, then the contents would be filtered by date ranges specified in a list I provide, followed by the action taken (with counts). Being newer to this, I am confused as to what approach I should take - for example should I use Python (should I be looping or iterating), or should the heavy lifting be done in PANDAS. If in PANDAS, I am having difficulty determining what function to use (I've been looking at get_dummy() etc.).

I'd imagine this would be accomplished with either 1. Some type or FOR loop iterating through, 2. Adding a column to the dataframe based on the list then filtering the data underneath based on the value(s), or 3. using a function I am not aware of in Pandas

I have explained more of my thought process in this question, but I am not sure if the question is unclear which is why it may be unanswered.

Building a dataframe with dynamic date ranges using filtered results from another dataframe

maow · Accepted Answer

There are quite a few concepts you need at once here.

First you dont yet have the count. From your desired output I took you want it yearly but you can specify any time frame you want. Then just count with groupby() and count()

In [66]: df2 = df.groupby([pd.to_datetime(df.date).dt.year, "action", "department"]).count().squeeze().rename("count")                                                                                            
Out[66]: 
date  action  department
2017  close   marketing     2
      open    sales         1
2018  close   marketing     1
      open    marketing     1
      other   marketing     1
2019  close   marketing     1
              sales         1
      open    sales         1
Name: count, dtype: int64

The squeeze() and rename() are there because afterwards both the count column and the year would be called date and you get a name conflict. You could equivalently use rename(columns={'date': 'count'}) and not cast to a Series.

The second step is a pivot_table. This creates column names from values. Because there are combinations of date and action without a corresponding value, you need pivot_table.

In [62]: df2.pivot_table(index="department", columns=["date", "action"])                                                                                                                                    
Out[62]: 
           count                                 
date        2017       2018             2019     
action     close open close open other close open
department                                       
marketing    2.0  NaN   1.0  1.0   1.0   1.0  NaN
sales        NaN  1.0   NaN  NaN   NaN   1.0  1.0

Because NaN is internally representated as floating piont, your counts were also converted to floating point. To fix that, just append fillna and convert back to int.

In [65]: df2.reset_index().pivot_table(index="department", columns=["date", "action"]).fillna(0).astype(int)                                                                                                              
Out[65]: 
           count                                 
date        2017       2018             2019     
action     close open close open other close open
department                                       
marketing      2    0     1    1     1     1    0
sales          0    1     0    0     0     1    1

To get exactly you output you would need to modify pd.to_datetime(df.date).dt.year. You can do this with strftime (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.strftime.html). Furthermore the column ["2017", "other"] was dropped because there was no value. If this creates problems you need to include the values beforehand. After the first step a reindex and a fillna should do the trick.

EDIT: Yes it does

In [77]: new_index = pd.MultiIndex.from_product([[2017, 2018, 2019], ["close", "open", "other"], ['marketing', 'sales']], names=['date', 'action', 'department'])
    ...:

In [78]: df3 = df2.reindex(new_index).fillna(0).astype(int).reset_index()
Out[78]: 
    date action department  count
0   2017  close  marketing      2
1   2017  close      sales      0
2   2017   open  marketing      0
3   2017   open      sales      1
4   2017  other  marketing      0
5   2017  other      sales      0
6   2018  close  marketing      1
..   ...    ...        ...    ...
11  2018  other      sales      0
12  2019  close  marketing      1
13  2019  close      sales      1
14  2019   open  marketing      0
15  2019   open      sales      1
16  2019  other  marketing      0
17  2019  other      sales      0


In [79]: df3.pivot_table(index="department", columns=["date", "action"])                                                                                                                                    
Out[79]: 
           count                                             
date        2017             2018             2019           
action     close open other close open other close open other
department                                                   
marketing      2    0     0     1    1     1     1    0     0
sales          0    1     0     0    0     0     1    1     0

Dynamic Column Data In Pandas Dataframe (Populated By Dataframe)

Answers (1)

Related Questions