How to convert data frame for time series analysis in Python?

Question

I have a dataset of around 13000 rows and 2 columns (text and date) for two year period. One of the column is date in yyyy-mm-dd format. I want to perform time series analysis where x axis would be date (each day) and y axis would be frequency of text on corresponding date.

I think if I create a new data frame with unique dates and number of text on corresponding date that would solve my problem.

Sample data

How can I create a new column with frequency of text each day? For example:

Thanks in Advance!

Bruce Swain · Accepted Answer

Depending on the task you are trying to solve, i can see two options for this dataset.

Either, as you show in your example, count the number of occurrences of the text field in each day, independently of the value of the text field.
Or, count the number of occurrence of each unique value of the text field each day. You will then have one column for each possible value of the text field, which may make more sense if the values are purely categorical.

First things to do :

import pandas as pd
df = pd.DataFrame(data={'Date':['2018-01-01','2018-01-01','2018-01-01', '2018-01-02', '2018-01-03'], 'Text':['A','B','C','A','A']})
df['Date'] = pd.to_datetime(df['Date']) #convert to datetime type if not already done

         Date Text
0  2018-01-01    A
1  2018-01-01    B
2  2018-01-01    C
3  2018-01-02    A
4  2018-01-03    A

Then for option one :

df = df.groupby('Date').count()

            Text
Date            
2018-01-01     3
2018-01-02     1
2018-01-03     1

For option two :

df[df['Text'].unique()] = pd.get_dummies(df['Text'])
df = df.drop('Text', axis=1)
df = df.groupby('Date').sum()

            A  B  C
Date               
2018-01-01  1  1  1
2018-01-02  1  0  0
2018-01-03  1  0  0

The get_dummies function will create one column per possible value of the Text field. Each column is then a boolean indicator for each row of the dataframe, telling us which value of the Text field occurred in this row. We can then simply make a sum aggregation with a groupby by the Date field.

If you are not familiar with the use of groupby and aggregation operation, i recommend that you read this guide first.

How to convert data frame for time series analysis in Python?

Answers (1)

Related Questions