Mike Mathews Jr
Mike Mathews Jr

Reputation: 47

Looping through csv files to create concatenated dataframe. Need to use date in file name to create new column for each file

I have some code that looks like this:

import pandas as pd
import glob

root_directory = r"\\some\shared\directory"

all_files = glob.glob(f'{root_directory}\CPE_*.csv')

li = []

for filename in all_files:
    frame = pd.read_csv(filename, index_col=None, header=0, encoding='latin1')
    li.append(frame)

df = pd.concat(li, axis=0, ignore_index=True)

This code allows me to concatenate the data and create a master csv file, but I want to add a new column to each dataframe as I loop through them. The file names look something like this: CPE_02082020.csv , CPE_02092020 , etc. So the date is in the file name itself with the format of mmddyyyy. How do I add a date column to each file as I loop through and concatenate them?

Upvotes: 1

Views: 1224

Answers (2)

Umar.H
Umar.H

Reputation: 23099

You can assign the column during a for loop, let's simplify a little using a list comprehension and Pathlib:

import pandas as pd 
from pathlib import Path 

root_dir = Path('\\some\shared\directory')

all_files = root_dir.glob('*CPE_*.csv')

dfs = [
    pd.read_csv(f, index_col=None, header=0, encoding='latin1')\
            .assign(date_col=
                pd.to_datetime(f.stem.split('_')[1],format='%m%d%Y',errors='coerce'))
    for f in all_files
]


print(pd.concat(dfs))

enter image description here

Upvotes: 0

Blake
Blake

Reputation: 105

To add a date column to each dataframe you can simply add a new column to frame while looping through all_files. This method was suggested by biobirdman on a previous post. You can then use split() to get just the date from the filename.

Try:

for filename in all_files:
    frame = pd.read_csv(filename, index_col=None, header=0, encoding='latin1')
    frame['date'] = filename.split('_')[1]
    li.append(frame)

df = pd.concat(li, axis=0, ignore_index=True)

Then to convert df['date'] to datetime add:

df['date'] = pd.to_datetime(df['date'])

Upvotes: 3

Related Questions