Reputation: 571
I have to read several Excel files(from a url link) that contain a fixated 5 columns. However the name of the columns may slightly diverge i.e('foo','fo','f(0)') because people.
There is a way that I can pass a list of lists containing, like [['foo','fo','f(0)'],['foo2','f02','f(o)2'],...]
for usecols?
Now I use this code:
links = df['column_I_need'].str.join(sep='')
col_names = ['foo','fo','f(0)']
for i in links:
try:
name = i[50:]
df = pd.read_excel(i, header = 1, names = col_names, encoding = 'utf-8') #usecols = names)
file_name = r"%s\%s" %(pasta_sol,name)
writer = pd.ExcelWriter(file_name , engine='xlsxwriter')
df.to_excel(writer, header = True, index = True)
writer.close()
except (TypeError, IndexError, ValueError, XLRDError, BadZipFile, urllib.error.URLError) as e:
erros.append((i, e.args[0]))
The information regarding each column in each file is for a specific field in the system.
I really couldn't find anything. In most of the files the value in the cells are correct, but people change the column name.
If someone have any idea I would be grateful.
Thanks
Upvotes: 1
Views: 1271
Reputation: 23099
This is the rough version of the function I used in my last role (I wasn't up to scratch with Git and didn't version control/save all my stuff)
What this will do is iterate over your chosen directory and return a list of matching excel files with their path and the column.
When the returned dictionary, you can loop over the filepaths and use the value as the usecols argument.
for path,column in return_value.items():
df = pd.read_excel(path,usecols=column)
return_value = find_common_column(r"C:\Users\datanovice\Documents\Python Scripts\Test"
,sheetname='Sheet1'
,col_list=['dat','test'])
print(return_value)
{WindowsPath('C:/Users/datanovice/Documents/Python Scripts/Test/doc_1.xlsx'): Index(['data'], dtype='object')}
import pandas as pd
import numpy
from pathlib import Path
from xlrd import XLRDError
def find_common_column(path,sheetname,col_list=list):
"""
Takes in three arguments and returns a
dictionary of paths and common columns
Path : Path to your excel files.
sheetname : the sheet we will use.
collist : columns you need to parse from each sheet.
"""
excel_dict = {f : pd.ExcelFile(f) for f in Path(path).glob('*.xlsx')}
pat = '|'.join(col_list)
dfs = {}
for filename,each_excel in excel_dict.items():
try:
df = pd.read_excel(each_excel,sheet_name=sheetname,nrows=1)
cols = df.filter(regex=pat,axis=1).columns
dfs[filename] = cols
except XLRDError as err:
pass
return dfs
Upvotes: 1