Gustavo Rottgering
Gustavo Rottgering

Reputation: 571

Read_Excel with list of list for usecols

I have to read several Excel files(from a url link) that contain a fixated 5 columns. However the name of the columns may slightly diverge i.e('foo','fo','f(0)') because people.

There is a way that I can pass a list of lists containing, like [['foo','fo','f(0)'],['foo2','f02','f(o)2'],...] for usecols?

Now I use this code:

links = df['column_I_need'].str.join(sep='')
col_names = ['foo','fo','f(0)']
for i in links:
    try:
        name = i[50:]
        df = pd.read_excel(i, header = 1, names = col_names, encoding = 'utf-8') #usecols = names)
        file_name = r"%s\%s" %(pasta_sol,name)

        writer = pd.ExcelWriter(file_name , engine='xlsxwriter')
        df.to_excel(writer, header = True, index = True)
        writer.close()

    except (TypeError, IndexError, ValueError, XLRDError, BadZipFile, urllib.error.URLError) as e:
        erros.append((i, e.args[0]))

The information regarding each column in each file is for a specific field in the system.

I really couldn't find anything. In most of the files the value in the cells are correct, but people change the column name.

If someone have any idea I would be grateful.

Thanks

Upvotes: 1

Views: 1271

Answers (1)

Umar.H
Umar.H

Reputation: 23099

This is the rough version of the function I used in my last role (I wasn't up to scratch with Git and didn't version control/save all my stuff)

What this will do is iterate over your chosen directory and return a list of matching excel files with their path and the column.

When the returned dictionary, you can loop over the filepaths and use the value as the usecols argument.

for path,column in return_value.items():
    df = pd.read_excel(path,usecols=column)

In Action.

return_value = find_common_column(r"C:\Users\datanovice\Documents\Python Scripts\Test"
,sheetname='Sheet1'
,col_list=['dat','test'])

print(return_value)

{WindowsPath('C:/Users/datanovice/Documents/Python Scripts/Test/doc_1.xlsx'): Index(['data'], dtype='object')}

Modules

import pandas as pd 
import numpy 
from pathlib import Path
from xlrd import XLRDError

Function

def find_common_column(path,sheetname,col_list=list):
    """
    Takes in three arguments and returns a 
    dictionary of paths and common columns

    Path : Path to your excel files.  
    sheetname : the sheet we will use. 
    collist : columns you need to parse from each sheet.
    """
    excel_dict = {f : pd.ExcelFile(f) for f in Path(path).glob('*.xlsx')}

    pat = '|'.join(col_list)

    dfs = {}
    for filename,each_excel in excel_dict.items():
        try:
            df = pd.read_excel(each_excel,sheet_name=sheetname,nrows=1)
            cols = df.filter(regex=pat,axis=1).columns
            dfs[filename] = cols

        except XLRDError as err:
            pass
    return dfs

Upvotes: 1

Related Questions