Reputation: 87
I need to create a pandas dataframe in Python by reading in an Excel spreadsheet that contains almost 50,000 rows and 81 columns. The file contains information about medical professionals of all kinds: physicians, nurses, nurse practitioners, etc. I want to read in only rows where a column 'PROFTYPE' has value of 'NURSEPRACT'.
I'm using Python 3.73, and I've read in the entire file, and then I trim it down by the column PROFTYPE afterward; but the reading in takes too long. I'd like to read in only those rows where PROFTYPE == 'NURSEPRACT'.
df_np = pd.read_excel(SourceFile, sheetname='Data', header=0)
df_np = df_np[df_np['PROFTYPE'] == 'NURSEPRACT']
This code actually works, but that's because I'm reading in the entire file first. I'm actually interested in reading in only those that meet the condition of PROFTYPE = 'NURSEPRACT'.
Upvotes: 2
Views: 3054
Reputation: 1306
One idea is that you can
Here that strategy is in action:
df = pd.read_excel(SourceFile,
sheet_name='Data',
header=0,
usecols=['PROFTYPE']) # <-- Load just 'PROFTYPE' of the following table
# ID PROFTYPE YEARS_IN_PRACTICE
# 1234 NURSEPRACT 12
# 43 NURSE 32
# 789 NURSEPRACT 4
# 34 PHYSICIAN 2
# 93 NURSEPRACT 13
row_numbers = [x+1 for x in df[df['PROFTYPE'] != 'NURSEPRACT'].index]
df = pd.read_excel(SourceFile, sheet_name='Data', header=0, skiprows=row_numbers)
# ID PROFTYPE YEARS_IN_PRACTICE
# 1234 NURSEPRACT 12
# 789 NURSEPRACT 4
# 93 NURSEPRACT 13
Upvotes: 1