user3341332
user3341332

Reputation: 439

How can I get specific rows from an excel file

I am using pandas to read in an Excel file that has the following structure:

            A                 B         C       D        E
  ╔════════════════════╦════════════╦═══════╦═══════╦══════════╗
1 ║ Project Name       ║            ║       ║       ║          ║
  ╠════════════════════╬════════════╬═══════╬═══════╬══════════╣
2 ║ Date of generation ║            ║       ║       ║          ║
  ╠════════════════════╬════════════╬═══════╬═══════╬══════════╣
3 ║ ID                 ║ Type       ║ Value ║ Color ║ Capacity ║
  ╠════════════════════╬════════════╬═══════╬═══════╬══════════╣
4 ║ 1                  ║ Car        ║ 1,000 ║ Red   ║ 2        ║
  ╠════════════════════╬════════════╬═══════╬═══════╬══════════╣
5 ║ 2                  ║ Truck      ║ 800   ║ Green ║ 12       ║
  ╠════════════════════╬════════════╬═══════╬═══════╬══════════╣
6 ║ 3                  ║ Helicopter ║ 5,000 ║ White ║ 4        ║
  ╚════════════════════╩════════════╩═══════╩═══════╩══════════╝

As you can see, the data headings are on row 3, and the information continues underneath. If I read the frame in like so:

df = pandas.read_excel("sample.xls", sheetname=0, skiprows=2, header=0)

my dataframe has the correct portion of data, but I'm unable to read the information in the first cells, A1 and A2.

I've tried the following as well, to 'slice' a new dataframe once I have the top two values taken. I get an error:

df = pandas.read_excel("sample.xls", sheetname=0, skiprows=0, header=None)

project_name = df.iloc[0][0]
project_date = df.iloc[1][0]

new_header = df.iloc[2]         #grab the first row for the header
df = df[3:]                     #take the data less the header row
df.rename(columns = new_header) #set the header row as the df header

#tidyup   
df = df.fillna('')              #convert NaN (blank) cells to empty string
df.applymap(str)                #convert all columns to str
df.columns = df.columns.str.strip()   #strip whitespace before and after headers
df_obj = df.select_dtypes(['object']) #select all object columns (not int/float columns)
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip()) #strip (trim whitepace) of all object columns

Running the above I get an error:

AttributeError: Can only use .str accessor with string values (i.e. inferred_type is 'string', 'unicode' or 'mixed')

Its not clear to me why as the same 'tidyups section but reading the df as in my first code line above work.

Is there a way I can get to the 'above header' information without creating a whole new dataframe by reading the file twice?

Upvotes: 2

Views: 187

Answers (1)

jezrael
jezrael

Reputation: 862396

There is problem you forget to assign back:

df = df.applymap(str)   

Also better (faster) is instead applymap use astype:

df = df.astype(str)   

All together:

project_name = df.iloc[0, 0]
project_date = df.iloc[1, 0]

new_header = df.iloc[2]        
df = df[3:]      
df.columns = new_header.str.strip()
df.columns.name = None

df = df.astype(str).reset_index(drop=True)
df = df.apply(lambda x: x.str.strip())
print (df)
  ID        Type  Value  Color Capacity
0  1         Car  1,000    Red        2
1  2       Truck    800  Green       12
2  3  Helicopter  5,000  White        4

Upvotes: 3

Related Questions