rrai
rrai

Reputation: 43

How to map filenames to the columns extracted from multiple excel files using pandas

I'm trying to extract all columns from multiple excel files and then map the filename to each extracted column however I'm struggling to work around a 'TypeError: Index does not support mutable operations'.

Below are my two files:

Fund_Data.xlsx:
    
    FUND ID  FUND NAME           AMOUNT   client code  Price description   Trade Date    Trade Datetime  
0   10101   Holdings company A  10000.5   1234        124.3   abcd        2020-08-19    2020-08-19 12:30:00 
1   20202   Holdings company B  -2000.5   192         -24.2   abcd        2020-08-20    2020-08-20 12:30:00 
2   30303   Holdings company C  3000.5    123         192     NaN         2020-08-21    2020-08-21 12:30:00 
3   10101   Holdings company A  10000     1234567     5.5     NaN         2020-08-22    2020-08-22 12:30:00
4   20202   Holdings company B  10000.5   9999        3.887   abcd        2020-08-23    2020-08-23 12:30:00
Stocks.xlsx

ID   STOCK                     VALUE
1   3i                         100
2   Admiral Group              200
3   Anglo American             300
4   Antofagasta                100
5   Ashtead                    200
6   Associated British Foods   300
7   AstraZeneca                400
8   Auto Trader Group          500
9   Avast                      600

And here is my code so far:

import pandas as pd
from os import walk

f = []
directory = 'C:/Users/rrai020/Documents/Python Scripts/DD'
for (dirpath, dirnames, filenames) in os.walk(directory):
    for x in filenames:
        if x.endswith('xlsx'):
            f.append(x)
#f = ['Fund_Data.xlsx', 'Stocks.xlsx'] created a list from filenames in directory ^^^

data = pd.DataFrame() # initialize empty df
for filename in f:
    df = pd.read_excel(filename, dtype = object, ignore_index=True).columns # read in each excel to df
    df['filename'] = filename # add a column with the filename
    data = data.append(df) # add all small df's to big df 

print(data)

I'm trying to achieve the following output (or similar):

Field Name       Filename
FUND ID          Fund_Data.xlsx
FUND NAME        Fund_Data.xlsx
AMOUNT           Fund_Data.xlsx
client code      Fund_Data.xlsx
Price            Fund_Data.xlsx
description      Fund_Data.xlsx
Trade Date       Fund_Data.xlsx
Trade Datetime   Fund_Data.xlsx
Trade time       Fund_Data.xlsx
ID               Stocks.xlsx
STOCK            Stocks.xlsx
VALUE            Stocks.xlsx

I would like the code to be flexible so that it can work for more than the 2 files I have here. Apologies if this is trivial, I'm still learning!

Upvotes: 1

Views: 655

Answers (1)

perl
perl

Reputation: 9941

The problem is with the dataframe that you're appending. We need to create a dataframe with Field Name, Filename columns for each file inside the loop, and then append it to data.

Here's an option:

data = pd.DataFrame()
for filename in f:
    # read in each excel to df
    df = pd.read_excel(filename, dtype = object, ignore_index=True).columns

    # create a dataframe with (Field Name, Filename) columns for current file
    x = pd.DataFrame({'Field Name': x.columns, 'Filename': filename})
    
    # append to the global dataframe
    data = data.append(x)

data

Output:

          Field Name        Filename
0            FUND ID  Fund_Data.xlsx
1          FUND NAME  Fund_Data.xlsx
2             AMOUNT  Fund_Data.xlsx
3        client code  Fund_Data.xlsx
4  Price description  Fund_Data.xlsx
5         Trade Date  Fund_Data.xlsx
6     Trade Datetime  Fund_Data.xlsx
7                 ID     Stocks.xlsx
8              STOCK     Stocks.xlsx
9              VALUE     Stocks.xlsx

Upvotes: 1

Related Questions