diego
diego

Reputation: 23

How to merge many CSV files?

I have about 7500 csv that needs to be merged into a single one in order to create an easy-readable table. The files format are as follow:

  1. each file is denominated as a ticker of a stock (example: AA.csv, AAL.csv, AAPL.csv,etc...)

  2. every file itself contains a date and a number in this format

    2018-10-11,1
    2018-10-12,3
    2018-10-15,2
    ...
    

Now I want to merge them into a single csv file where the resulting table has in the header the name of the tickers, in the first column the dates and following the numbers (obviously keeping the csv format).

Example:

screenshot of example

Note that some csv files are empty and some have differents starting dates or dates are just discontinued

Upvotes: 1

Views: 147

Answers (2)

user5658625
user5658625

Reputation:

Since you have non-uniform csvs, and some specific requirements:

  • certain format to output the csv
  • some csv files are empty
  • some csv files have different start dates
  • some csv files have discontinues dates

I'd do something like this with the csv module:

import os,csv
csv_dir = "C:\\Users\Diego\\PycharmProjects\\thesis\\Tickers List\\"
csv_out = 'csvout.csv'

all_dates = []
fin_table = dict()

# iterating over all csv files
for csv_file in os.listdir(csv_dir):
    if True:

        csv_name = csv_file #[:-4] # excluding the ".csv" extension 
        table = dict()
        
        print(csv_dir+csv_file)
        with open(csv_dir+csv_file, 'r') as r:
            csvReader = csv.reader(r)

            for i, row in enumerate(csvReader):
                all_dates.append(row[0])
                table[row[0]] = row[1]
        fin_table[csv_name] = table
    

# populating the final csv 
with open(csv_out, 'w+') as w:
    csvWriter = csv.writer(w)

    # adding the header
    input_row = ['Dates']
    for stock, table in iter(sorted(fin_table.items())):
        input_row.append(stock)
    csvWriter.writerow(input_row)

    # Adding the rest of the table by available date 
    # since, some dates may not have a value, as you said   
    for date in sorted(all_dates): 
        
        # adding the date rows and stock values 1 by 1
        input_row = [date]
        for stock, table in iter(sorted(fin_table.items())):
            # NOVALUE will be put in cells  
            # where a value doesn't exist for a date
            input_row.append(table.get(date, 'NOVALUE')) 
        csvWriter.writerow(input_row)

# ************************** end of program code

Upvotes: 0

Ángel Igualada
Ángel Igualada

Reputation: 891

You could do something like this:

import pandas as pd
import numpy as np
from glob import glob

dfs_list = []
for csv_file in glob('Tickers List/*.csv'):
    stock_ticker = csv_file.split(".")[0]
    df = pd.read_csv(csv_file,header=None, names=["date","num"])
    if df.shape[0] >0:
        df["date"] = pd.to_datetime(df["date"],format="%Y-%m-%d")
        df["stock_ticker"] = stock_ticker
        dfs_list.append(df)

final_df = pd.concat(dfs_list)

With glob('dir/*.csv') we get all the csv files on a folder.

After this, you will have a DataFrame that looks like this: enter image description here

If you want to change to your format, you can do this: (note that the dates are automatically sorted because is used as index).

final_df = pd.pivot_table(final_df, values='num', index=['date'],
               columns=['stock_ticker'], fill_value=np.nan)

And you will have a DataFrame that looks like this:

enter image description here

Now you can write this DataFrame to a new csv with:

final_df.to_csv("merged.csv")

FULL CODE

import pandas as pd
import numpy as np
from glob import glob

dfs_list = []
for csv_file in glob('Tickers List/*.csv'):
    stock_ticker = csv_file.split(".")[0]
    df = pd.read_csv(csv_file,header=None, names=["date","num"])
    if df.shape[0] >0:
        df["date"] = pd.to_datetime(df["date"],format="%Y-%m-%d")
        df["stock_ticker"] = stock_ticker
        dfs_list.append(df)

final_df = pd.concat(dfs_list)

final_df = pd.pivot_table(final_df, values='num', index=['date'],
               columns=['stock_ticker'], fill_value=np.nan)

final_df.to_csv("merged.csv")

Upvotes: 1

Related Questions