Savina Dimitrova
Savina Dimitrova

Reputation: 155

Python Pandas - extract excel data by column index and column header at the same time

I'm trying to write a Pandas script which to extract data from several Excel files. They contain between 10 and 15 columns. From these columns I need the 1st one which has different header in every file, and some other columns which always have the same header names ('TOTAL', 'CLEAR', 'NON-CLEAR'and 'SYSTEM') but they are positioned under different column index in the different files. (I mean that in one of the files 'TOTAL' is the 3rd column in the table but in another file it is the 5th column)

I know that using usecols keyword I could specify which columns to use, but it looks like this argument takes only header names or only column indices, and never both of them in a combination.

Is it possible to write a statement which to take at the same time the 1st column by its index and then the other ones by header name?

The below statement doesn't work:

df = pd.read_excel(file, usecols = [0,'TOTAL', 'CLEAR', 'NON-CLEAR','SYSTEM'])

Upvotes: 1

Views: 1214

Answers (3)

Balaji Ambresh
Balaji Ambresh

Reputation: 5012

Here you go:

def callable():
    first_column = None
    def process(column_name):
        nonlocal first_column
        if first_column is None:
            first_column = column_name
            return True
        if first_column == column_name:
            return True
        return column_name in ['TOTAL', 'CLEAR', 'NON-CLEAR','SYSTEM']
    return process
print(pd.read_csv(file, usecols=callable()))

Upvotes: 0

Oliver Richardson
Oliver Richardson

Reputation: 281

If it has only fifteen columns, it is probably faster not to read the file twice. You can read the whole file into memory and then extract the columns you need with the much nicer pandas interface:

df = pd.read_excel(file)
df = df[ [ df.columns[0], 'TOTAL', 'CLEAR', 'NON-CLEAR','SYSTEM'] ]

Upvotes: 0

TiTo
TiTo

Reputation: 865

you could use pd.read_excel()twice and than join both dfs

df1 = pd.read_excel(file, usecols = [0])
df2 = pd.read_excel(file, usecols = ['TOTAL', 'CLEAR', 'NON-CLEAR','SYSTEM'])
df = pd.concat([df1, df2], axis = 1, join = 'outer')

Upvotes: 1

Related Questions