AJCaffarini
AJCaffarini

Reputation: 87

How do I read from an Excel spreadsheet only rows meeting a certain condition into Python?

I need to create a pandas dataframe in Python by reading in an Excel spreadsheet that contains almost 50,000 rows and 81 columns. The file contains information about medical professionals of all kinds: physicians, nurses, nurse practitioners, etc. I want to read in only rows where a column 'PROFTYPE' has value of 'NURSEPRACT'.

I'm using Python 3.73, and I've read in the entire file, and then I trim it down by the column PROFTYPE afterward; but the reading in takes too long. I'd like to read in only those rows where PROFTYPE == 'NURSEPRACT'.

df_np = pd.read_excel(SourceFile, sheetname='Data', header=0)
df_np = df_np[df_np['PROFTYPE'] == 'NURSEPRACT']

This code actually works, but that's because I'm reading in the entire file first. I'm actually interested in reading in only those that meet the condition of PROFTYPE = 'NURSEPRACT'.

Upvotes: 2

Views: 3054

Answers (1)

kevins_1
kevins_1

Reputation: 1306

One idea is that you can

  1. load only the 'PROFTYPE' column,
  2. identify the non-nurse practitioner rows,
  3. load the entire table to keep only the nurse practitioner rows.

Here that strategy is in action:

df = pd.read_excel(SourceFile,
                   sheet_name='Data',
                   header=0,
                   usecols=['PROFTYPE']) # <-- Load just 'PROFTYPE' of the following table
# ID    PROFTYPE    YEARS_IN_PRACTICE  
# 1234  NURSEPRACT  12
# 43    NURSE   32
# 789   NURSEPRACT  4
# 34    PHYSICIAN   2
# 93    NURSEPRACT  13
row_numbers = [x+1 for x in df[df['PROFTYPE'] != 'NURSEPRACT'].index]
df = pd.read_excel(SourceFile, sheet_name='Data', header=0, skiprows=row_numbers)
# ID    PROFTYPE    YEARS_IN_PRACTICE
# 1234  NURSEPRACT  12
# 789   NURSEPRACT  4
# 93    NURSEPRACT  13

Upvotes: 1

Related Questions