Adding dataframe column names based on filename after merging using Glob

Question

I have Excel files in a folder, all in the same format with data for all countries in the world in the sheet 'Dataset2' in each file.

I have merged all files together into one using glob, but I need to know which file (i.e. which country) each column comes from.

Is there a way to do this?

import glob
import os
import pandas as pd

os.chdir("Countries/")
extension = 'xlsx'

all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

combined = pd.concat([pd.read_excel(f, sheet_name='Dataset2') for f in all_filenames ],axis=1, ignore_index=True)

combined.to_excel( "New/combined.xlsx", index=False, encoding='utf-8-sig')

Dan · Accepted Answer

You could unpack the list comprehension into a for-loop and add an additional column to each data file, something like this:

import glob
import os
import pandas as pd

os.chdir("Countries/")
extension = 'xlsx'

all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

file_list = []
for f in all_filenames:
    data = pd.read_excel(f, sheet_name='Dataset2')
    data['source_file'] = f  # create a column with the name of the file
    file_list.append(data)

combined = pd.concat(file_list, axis=1, ignore_index=True)

combined.to_excel( "New/combined.xlsx", index=False, encoding='utf-8-sig')

Adding dataframe column names based on filename after merging using Glob

Answers (2)

Related Questions