Reputation: 201
So basically I have a ton of files that change each week that passes by, I want to know if there is a way that I can go ahead and specify to the python script to grab that sheet that contains an specific column name, so for example
test.xlsx
I have the following structuresheet 1
columnA | columnB | columnC | Columnd | ColumnE | ColumnF |
---|---|---|---|---|---|
dsf | sdfas | asdf | sadf | asfdsd | sdfasf |
sdfsd | sadfsd | asdfsd | asdfasd | asdfdsf | sdfadf |
Sheet 2
jira_tt | alignID | issueID | testID |
---|---|---|---|
dsf | sdfas | asdf | sadf |
As you can see the excel file has 2 sheets, however this is just an example as some file may have more than 2 sheets or the names of the sheets will change, as stated above I want to read all the sheets in all the files that have the keyword "jira" on their columns, so far I have been able to create a script that reads all the files on the target folder, however I don't have a clue on how to specify the sheet as I needed, here is part of the code that I've created so far
import glob as glob
import pandas as pd
#using glob to get all the files that contains an xlsx extension
ingestor = glob.glob("*.xlsx")
for f in ingestor:
df = pd.read_excel(f)
df
Any kind of assistance or guidance will be appreciated it.
Upvotes: 0
Views: 914
Reputation: 1
First, note that pd.read_excel(f)
returns the first sheet by default. If you want to get another sheet or more than one, you should use the sheet_name
argument of pandas.read_excel
, read here.
For the case that the number of sheets is unknown, specify None
to get all worksheets:
pd.read_excel(f, sheet_name=None)
Note that now a dict of DataFrames is returned.
To get the sheets with a column that contain "jira", simply check the column's name:
for f in files:
df_dict = pd.read_excel(f, sheet_name=None)
for sheet_name, df in df_dict.items():
for column in df.columns:
if 'jira' in column:
# do something with this column or df
print(df[column])
Upvotes: 0
Reputation: 1602
To include all your files as DataFrame's
you can create a list to store it and use merge()
method to split files in one DataFrame
, for example:
import glob as glob
import pandas as pd
ingestor = glob.glob("*.xlsx")
df = reduce(lambda left,right: pd.merge(left,right), [pd.read_excel(data) for data in ingestor])
print(df.isin(['jira']))
if want just files that contains a specific word (like "jira"
), you need to evaluate with a conditional using any()
method on each iteration and merge data:
ingestor = glob.glob("*.xlsx")
df = pd.DataFrame()
for f in ingestor:
data = pd.read_excel(f)
df.insert(data[data.str.contains("jira"))
print(df)
Upvotes: 1