user11015000
user11015000

Reputation: 159

How to join time series data in python with differing headings?

I currently have about 500 csv files worth of time series data. I am joining them using this code:

from glob import glob

with open('master.csv', 'a') as singleFile:
    for csv in glob('*data.csv'):
        if csv == 'master.csv':
            pass
        else:
            for line in open(csv, 'r'):
                singleFile.write(line)

However there is an issue with this, the headings of each file is not the same, as the first row has a list of companies, now as time goes on some companies retire and some companies join the market, thus there are inconsistencies as I scroll down my merged file.

To illustrate my problem I'll do an example:

This is currently the format of the first three of my csv files.

current

My desired output looks like this:

desired

Currently I am arriving at my desired output by manually moving around the columns, is there any way when merging my files python can detect a new column and move it to the right?

Many thanks!

Upvotes: 2

Views: 377

Answers (1)

Darius
Darius

Reputation: 12042

Data

1.csv:

a,b
1,2
1,2

2.csv:

b,c
2,3
2,3

3.csv:

c,d
3,4
3,4

Pandas

You can use pandas to load and convert the CSV files to dataframes. After that you can concatenate them to a single dataframe which can be saved back to a CSV file.

enter image description here

A full example:

from glob import glob
import pandas as pd

df = pd.concat([pd.read_csv(f, delimiter=',') for f in glob('*.csv') if not f.startswith('all.csv')], ignore_index=True, sort=False)

print(df)
#      a    b    c    d
# 0  1.0  2.0  NaN  NaN
# 1  1.0  2.0  NaN  NaN
# 2  NaN  NaN  3.0  4.0
# 3  NaN  NaN  3.0  4.0
# 4  NaN  2.0  3.0  NaN
# 5  NaN  2.0  3.0  NaN

df.to_csv('all.csv', index=False)
# a,b,c,d
# 1.0,2.0,,
# 1.0,2.0,,
# ,,3.0,4.0
# ,,3.0,4.0
# ,2.0,3.0,
# ,2.0,3.0,

Image source: https://pandas.pydata.org/pandas-docs/version/0.20/merging.html

Upvotes: 1

Related Questions