Merge multiple data files (with multiple columns) that are parts of a time series

Question

I have a multiple time series data (txt files) that form one time series. Here is an oversimplified example:

    import pandas as pd
    import datetime as dt

    df1_1 = pd.DataFrame({'Date': [15/03/2019 10:00:11.000, 15/03/2019 10:00:12.000 , 15/03/2019 10:00:13.000],
   'Cond [mS/cm]': [7.45, 7.45, 7.45],
   'Temp [C]': [8.22, 8.22, 8.22],
    'Sal [PSU]': [7.63, 7.63, 7.63]})

    df1_2 = pd.DataFrame({'Date': [30/03/2019 10:00:11.000, 30/03/2019 10:00:12.000 , 30/03/2019 10:00:13.000],
   'Cond [mS/cm]': [7.45, 7.45, 7.45],
   'Temp [C]': [8.22, 8.22, 8.22],
    'Sal [PSU]': [7.63, 7.63, 7.63]})

    df2_1 = pd.DataFrame({'Date': [15/03/2019 10:00:11.000, 15/03/2019 10:00:12.000 , 15/03/2019 10:00:13.000],
   'Cond_2 [mS/cm]': [7.47, 7.47, 7.47],
   'Temp_2 [C]': [8.22, 8.22, 8.22],
    'Sal_2 [PSU]': [7.67, 7.67, 7.67]})  

     df2_2 = pd.DataFrame({'Date': [30/03/2019 10:00:11.000, 30/03/2019 10:00:12.000 , 30/03/2019 10:00:13.000],
   'Cond_2 [mS/cm]': [7.47, 7.47, 7.47],
   'Temp_2 [C]': [8.22, 8.22, 8.22],
    'Sal_2 [PSU]': [7.67, 7.67, 7.67]})

Where df1 means data from sensor 1, df1_1 and df_2 are next fragments of time series.

I want to combine all into one big DataFrame.

So far I've:

genereted file containing complete time series (data from sensors might have unexpected gaps),
read files as DataFrames,
defined headers,
set 'Date' column as an index of each DataFrame.

Then I wanted to join DataFrames to the complete time series by:

full_date = dates.join([df1_1, df1_2], how = "outer")

but I recived an error:

ValueError: Indexes have overlapping values: Index(['Cond [mS/cm]', 'Press [DBar]', 'Temp [C]', 'Sal [PSU]', 'Dens.anom [kg/m3]', 'SOS [m/s]'], dtype='object')

To check if the problem is really in the headers I merged files with slightly different headers (different headers = different sensor). Then I've recived:

    complete_df = ({' ':[01/04/2019 00:00:01, 01/04/2019 00:00:01 , 01/04/2019 00:00:01,
    'Cond [mS/cm]': [NaN, NaN, NaN],
    'Temp [C]': [NaN, NaN, NaN],
    'Sal [PSU]': [NaN, NaN, NaN],
    'Cond_2 [mS/cm]': [NaN, NaN, NaN],
    'Temp_2 [C]': [NaN, NaN, NaN],
    'Sal_2 [PSU]': [NaN, NaN, NaN]})

that is quite reasonable except that the data are not in the right order.

My question is: how should I combine all Dataframes to get one big DataFrame?

Gustavo Gradvohl · Accepted Answer

Pass your dates as strings, than convert them to DateTime. After that use pd.concat.There is still the issue that you are duplicating columns potentially unnecessarily (Cond[mS/cm], Cond_2[mS/cm]). Alternativelly you could have all columns named the same and pass one additional column with the sensor identifier (i.e "sensor":1)

import pandas as pd

df1_1 = pd.DataFrame({'Date': ['15/03/2019 10:00:11.000', '15/03/2019 10:00:12.000' , '15/03/2019 10:00:13.000'],
'Cond [mS/cm]': [7.45, 7.45, 7.45],
   'Temp [C]': [8.22, 8.22, 8.22],
    'Sal [PSU]': [7.63, 7.63, 7.63]})

df1_2 = pd.DataFrame({'Date': ['30/03/2019 10:00:11.000', '30/03/2019 10:00:12.000' , "30/03/2019 10:00:13.000"],
   'Cond [mS/cm]': [7.45, 7.45, 7.45],
   'Temp [C]': [8.22, 8.22, 8.22],
    'Sal [PSU]': [7.63, 7.63, 7.63]})

df2_1 = pd.DataFrame({'Date': ['15/03/2019 10:00:11.000', '15/03/2019 10:00:12.000' , '15/03/2019 10:00:13.000'],
   'Cond_2 [mS/cm]': [7.47, 7.47, 7.47],
   'Temp_2 [C]': [8.22, 8.22, 8.22],
    'Sal_2 [PSU]': [7.67, 7.67, 7.67]})  

df2_2 = pd.DataFrame({'Date': ['30/03/2019 10:00:11.000', '30/03/2019 10:00:12.000' , '30/03/2019 10:00:13.000'],
   'Cond_2 [mS/cm]': [7.47, 7.47, 7.47],
   'Temp_2 [C]': [8.22, 8.22, 8.22],
    'Sal_2 [PSU]': [7.67, 7.67, 7.67]}) 

complete_df  = pd.concat([df1_1,df1_2,df2_1,df2_2],ignore_index=True)
complete_df['Date'] = pd.to_datetime(complete_df['Date'])
complete_df.set_index('Date', inplace=True)
complete_df.fillna(0)

Merge multiple data files (with multiple columns) that are parts of a time series

Answers (1)

Related Questions