JMV12
JMV12

Reputation: 1045

Merge Two CSV Files With Cases Using Python

I need to merge two files of data together following some rules in the process. I'm slightly aware of how to use pandas to create and merge csv's, but I do not know how to do so following certain rules.

I have two files as exampled:

actual_data.csv

type    region_1    region_2    date         data
aw      west        2           10/01/2017   0.9318274971234
hss     east        3           05/12/2015   0.7136487234972
.       .           .           .            .
.       .           .           .            .
.       .           .           .            .

forecast_data.csv

type    region_1    region_2    date         data
jad     north       22          11/13/2025   0.71283741932
js      east        3           04/16/2023   0.16238471239
.       .           .           .            .
.       .           .           .            .
.       .           .           .            .

I need to merge these files following these rules:

  1. Merge actual_data.csv with forecast_data.csv
  2. If a date for actual_data.csv is missing, use forecast_data.csv for that date
  3. Always use actual_data.csv over forecast_data.csv when both are available for that specific month

Upvotes: 3

Views: 503

Answers (2)

it's-yer-boy-chet
it's-yer-boy-chet

Reputation: 2017

You need to do a left join of actual onto forecast (this based on your requirement that missing actual data should be filled, if you expect missing forecast data you would want to do an outer join).

import pandas as pd

act_df = pd.read_csv(actual_data.csv)
fore_df = pd.read_csv(forecast_data.csv)

res = fore_df.merge(act_df, on=['type', 'region_1', 'region_2'], how='left', suffixes=('_fore', '_act'))

At this point res will have an index of the merge columns ['type', region_1', region_2'], 2 date columns (date_fore, date_act), and 2 data columns (data_fore, date_act). From there you can fill the null values of the actual date with the forecast date and then collapse it to a single column if you like:

res['date'] = res['date_act'].fillna(res['date_fore'])
res.drop(['date_fore', 'date_act'], axis=1, inplace=True)

For the data it's a similar operation:

res['data'] = res['data_act'].fillna(res['data_fore'])
res.drop(['data_fore', 'data_act'], axis=1, inplace=True)

Upvotes: 3

Poojan
Poojan

Reputation: 3519

  • One solution is to use pandas combine_first.
  • here i assume both dataframe same shape.
  • Here i am giving first priority to all the columns.
  • You can loop over columns which you want to pick from first df.
  • Here is a small example of how you can achieve similar.
import numpy as np
import pandas as pd

df1 = pd.DataFrame({'a':[1,2,np.nan], 'b':['x','y','z']})
df2 = pd.DataFrame({'a':[1,3,4], 'b':['a','b','c']})
df = pd.DataFrame()
for col in df1.columns:
    df[col] = df1[col].combine_first(df2[col])
df

output:

# df1
      a     b
0   1.0     x
1   2.0     y
2   NaN     z

#df2
    a   b
0   1   a
1   3   b
2   4   c

#df
    a       b
0   1.0     x
1   2.0     y
2   4.0     z

Upvotes: 1

Related Questions