Reputation: 472
I have two different CSVs with time series data. One Table is continuous, starting in 01.01.2017 at 00:00. From there each row represents one hour (1. Table). The data looks kind of like this:
Date, Volume
2017-02-03 12-PM, 9787.51
2017-02-03 01-PM, 9792.01
2017-02-03 02-PM, 9803.94
2017-02-03 03-PM, 9573.99
The other table contains events that happened and are serialized by UNIX datetime in seconds. I was able to convert it to datetime and group it by hour with this code:
df['datetime'] = pd.to_datetime(df['created_utc'], unit='s')
df['datetime'] = pd.to_datetime(df['datetime'], format="%Y-%m-%d %I-%p")
df['date_by_hour'] = df['datetime'].apply(lambda x: x.strftime('%Y-%m-%d %H:00'))
This resulted in this data:
created_utc, score, compound, datetime, date_by_hour
1486120391, 156, 0.125, 2017-02-03 12:13:11, 2017-02-03 12:00:00
1486125540, 1863, 0.475, 2017-02-03 13:39:00, 2017-02-03 13:00:00
1486126013, 863, 0.889, 2017-02-03 13:46:53, 2017-02-03 13:00:00
1486130203, 23, 0.295, 2017-02-03 14:56:43, 2017-02-03 14:00:00
Now I need to map the events (2.table) to the Time Series of the 1. Table. If multiple events happened in one hour, i need to make an addition of the scores and calculate the mean average of the compound. In the end i want to have a dataframe like this:
Date, Volume, score, compound,
2017-02-03 12-PM, 9787.51, 156, 0.125,
2017-02-03 01-PM, 9792.01, 2726, 0.682,
2017-02-03 02-PM, 9803.94, 23, 0.295,
2017-02-03 03-PM, 9573.99, 0, 0,
I know my code below does not work and is wrong, but I wanted to show what I was thinking how I could achieve this. I thought I could loop through each row of my events table df2 and compare if the datetime matches. If so, I would calculate score and compound. The issue is that I know that one should not loop through a dataframe and I don't know how to loop through another dataframe at the same time and perform the right calculations based on the previous rows...
for index, row in df2.iterrows():
memory_score = 0
memory_compound = 0
if df1['Date'] == df2['date_by_hour']:
df1['score'] = row['score'] + memory_score
df1['compound'] = (row['compound'] + memory_compound) / 2
How can I get to my Final Dataframe? There must be some pandas magic that I could use to make this work and map the time series data to the right hours.
Thanks in advance and best regards
EDIT: THere can be an undefined number of events within one hour. I just chose 2 for this simple example but it could be 5000 or something or 0 in some cases.
Upvotes: 1
Views: 1237
Reputation: 7852
# import necessary packages, set seed
import pandas as pd
import datetime
import random
random.seed(42)
Let's create an example df1
, using this:
numdays=5
base = datetime.datetime(2017,2,3,12)
date_list = [base + datetime.timedelta(hours=x) for x in range(numdays)]
Then, using:
df1 = pd.DataFrame.from_dict({'Date': date_list,
'Volume': [random.randint(9000,11000) for _ in range(len(date_list))]})
This gives us:
+----+---------------------+----------+
| | Date | Volume |
|----+---------------------+----------|
| 0 | 2017-02-03 12:00:00 | 9228 |
| 1 | 2017-02-03 13:00:00 | 9051 |
| 2 | 2017-02-03 14:00:00 | 10518 |
| 3 | 2017-02-03 15:00:00 | 9563 |
| 4 | 2017-02-03 16:00:00 | 9501 |
+----+---------------------+----------+
Let's create df2
as well:
random_date_list = [base + datetime.timedelta(hours=x*random.uniform(0,2)) for x in range(7)]
df2 = pd.DataFrame({'datetime':random_date_list,
'score':[random.randint(20,200) for _ in range(len(random_date_list))],
'compound': [random.uniform(0,1) for _ in range(len(random_date_list))]},
index=[x for x in range(len(random_date_list))])
which gives:
+----+----------------------------+---------+------------+
| | datetime | score | compound |
|----+----------------------------+---------+------------|
| 0 | 2017-02-03 12:00:00 | 75 | 0.71602 |
| 1 | 2017-02-03 13:28:22.592742 | 79 | 0.701325 |
| 2 | 2017-02-03 14:42:24.472619 | 149 | 0.41952 |
| 3 | 2017-02-03 17:21:11.078662 | 174 | 0.449209 |
| 4 | 2017-02-03 12:41:43.838380 | 26 | 0.278191 |
| 5 | 2017-02-03 16:13:09.185509 | 163 | 0.8693 |
| 6 | 2017-02-03 12:21:27.239880 | 70 | 0.758807 |
+----+----------------------------+---------+------------+
Let's create a column in df2 with datetime objects to hour resolution:
df2['Date'] = df2['datetime'].apply(lambda x: x.replace(minute=0, second=0, microsecond=0))
We can merge df1
& df2
, replace NaNs with 0s:
merged = pd.merge(df1,df2,on='Date', how='outer')
merged.fillna(0,inplace=True)
Now calculate the required new columns:
newscoredf=merged.groupby('Date')[['score']].agg('sum')
newcompounddf=merged.groupby('Date')[['compound']].agg('mean')
Let's concat them, & add the Volume
column as well which we left behind:
final = pd.concat([df1.set_index('Date')[['Volume']],newscoredf,newcompounddf],axis=1)
which gives you what you want.
final
:
+---------------------+----------+---------+------------+
| Date | Volume | score | compound |
|---------------------+----------+---------+------------|
| 2017-02-03 12:00:00 | 9228 | 171 | 0.584339 |
| 2017-02-03 13:00:00 | 9051 | 79 | 0.701325 |
| 2017-02-03 14:00:00 | 10518 | 149 | 0.41952 |
| 2017-02-03 15:00:00 | 9563 | 0 | nan |
| 2017-02-03 16:00:00 | 9501 | 163 | 0.8693 |
| 2017-02-03 17:00:00 | nan | 174 | 0.449209 |
+---------------------+----------+---------+------------+
Checking: consider the rows which happened in an hour starting with 12. Score: Score: 75+26+70 = 171. Compound: (0.71602 + 0.278191 + 0.758807) / 3 = 0.584339. Both are in agreement with the first row of our result.
Upvotes: 2