Reputation: 276
I am new to pandas - I am trying to extract some useful information from a (very graphically designed) per week Excel file that represents the booking of certain rooms, 52 files for one year. It contains the name of the employee, the working hours and the project. i am mainly interested in
From all i read about Panda, this is pretty difficult case, since the representation of the data is not well suited for extraction.
Basically the only relevant info in the first column is the ROOM statement, which is then followed by descriptive 4 rows of text that i dont need in this context. Starting with the ROOM string, on each date column, there are 4 lines of relevant info which i need to extract. For my usecase, i dont need to know who worked in which room, but the ROOMS are what are used as the index.
Right now i am stuck in how to reformat the first column in a way that Panda can make use of it in a meaningful way. My idea is, that i i search for any line that contains ROOM, create a new index Column from the resulting boolean statement, use these as a multiindex, etc. but i am stuck at the very beginning.
Before i continue in that direction, i would like to understand if the general approach on how to make a document like this more usable for Pandas, i wanted to check if there are some best practices on how to deal with it and if my idea of creating a multindex out of the 4 row room description is the right approach... Like i said i am very new to pandas, so please forgive if there are general misunderstandings on how to work with this - or IF what i want to do is doable in Pandas at all...
import pandas as pd
from pathlib import Path
# Assign spreadsheet filename to `file`
file = './data/Week02-2019.xlsx'
# Load spreadsheet
xl = pd.ExcelFile(file)
# Remove all Sorts of crap on loading
df1 = xl.parse(sheet_name='Booking', header=1, parse_dates=[1], na_values=['xxx',''], skiprows=[2,3,4,5,6,7,8,9,10,19,28,37,46,64], usecols="A:H")
print (df1.iloc[:,0].str.contains("ROOM"),"NewIndex")
+---------------------+----------------+----------------+----------------+------------+------------+------------+------------+
| 2018 | 01.01.2018 | 02.08.2018 | 03.08.2018 | 04.08.2018 | 05.08.2018 | 06.08.2018 | 07.08.2018 |
+---------------------+----------------+----------------+----------------+------------+------------+------------+------------+
| ROOM 01 (Morning) | John Doe | Jane Doe | Donny Doe | | | | |
| Very Nice | Project# | Project# | Project# | | | | |
| Good Projector | Project Title | Project Title | Project Title | | | | |
| Telephone 1234 | 9:30-17.00 | 8-13.00 | 12-14.00 | | | | |
| --------- | ---- | ---- | ---- | --- | ---- | --- | --- |
| ROOM 01 (Afternoon) | Alan Smithee | Susi Smithee | Donald Smithee | | | | |
| Very Nice | Project# | Project# | Project# | | | | |
| Good Projector | Project Title | Project Title | Project Title | | | | |
| Telephone 1234 | 17:30-21.00 | 13.15-16.00 | 14.15-16.00 | | | | |
| ----- | ---------- | ---------- | --------- | ---- | --- | --- | |
| ROOM 02 (Morning) | Jimmy Doe | Duffy Duck | Benny Blanco | | | | |
| Not So Nice | Project# | Project# | Project# | | | | |
| Whiteboard | Project Title | Project Title | Project Title | | | | |
| Telephone 5678 | 9:30-17.00 | 8-13.00 | 12-14.00 | | | | |
| --------- | ---- | ---- | ---- | --- | ---- | --- | --- |
| ROOM 02 (Afternoon) | Doris Day | Teddy Kaczinsky| Ru Paul | | | |
| Not so Nice | Project# | Project# | Project# | | | | |
| Whiteboard | Project Title | Project Title | Project Title | | | | |
| Telephone 5678 | 17:30-21.00 | 13.15-16.00 | 14.15-16.00 | | | | |
+---------------------+----------------+----------------+----------------+------------+------------+------------+------------+
Upvotes: 0
Views: 124
Reputation: 863256
Use DataFrame.set_index
with DataFrame.stack
and Series.unstack
for reshape, output get MultiIndex
:
first = df.columns[0]
#repeat only ROOM data in first column
df[first] = df[first].where(df[first].str.contains("ROOM")).ffill()
#create helper columns
df['group'] = df.index % 4
#new columns names
d = {0:'name', 1:'project', 2:'project title', 3: 'time'}
df1 = (df.set_index([first, 'group'])
.rename(columns = lambda x: pd.to_datetime(x, format='%d.%m.%Y'))
.stack()
.unstack(1)
.rename(columns=d)
.swaplevel(1,0)
.sort_index()
.rename_axis(index=['date', 'room'], columns=None)
)
print (df1)
name project project title \
date room
2018-01-01 ROOM 01 (Afternoon) Alan Smithee Project# Project Title
ROOM 01 (Morning) John Doe Project# Project Title
ROOM 02 (Afternoon) Doris Day Project# Project Title
ROOM 02 (Morning) Jimmy Doe Project# Project Title
2018-08-02 ROOM 01 (Afternoon) Susi Smithee Project# Project Title
ROOM 01 (Morning) Jane Doe Project# Project Title
ROOM 02 (Afternoon) Teddy Kaczinsky Project# Project Title
ROOM 02 (Morning) Duffy Duck Project# Project Title
2018-08-03 ROOM 01 (Afternoon) Donald Smithee Project# Project Title
ROOM 01 (Morning) Donny Doe Project# Project Title
ROOM 02 (Afternoon) Ru Paul Project# Project Title
ROOM 02 (Morning) Benny Blanco Project# Project Title
time
date room
2018-01-01 ROOM 01 (Afternoon) 17:30-21.00
ROOM 01 (Morning) 9:30-17.00
ROOM 02 (Afternoon) 17:30-21.00
ROOM 02 (Morning) 9:30-17.00
2018-08-02 ROOM 01 (Afternoon) 13.15-16.00
ROOM 01 (Morning) 8-13.00
ROOM 02 (Afternoon) 13.15-16.00
ROOM 02 (Morning) 8-13.00
2018-08-03 ROOM 01 (Afternoon) 14.15-16.00
ROOM 01 (Morning) 12-14.00
ROOM 02 (Afternoon) 14.15-16.00
ROOM 02 (Morning) 12-14.00
EDIT: Error means there are some ROOM
values duplicated, so need handle it by GroupBy.cumcount
for new level in MultiIndex
:
print (df)
2018 01.01.2018 02.08.2018 03.08.2018
0 ROOM 01 (Morning) John Doe Jane Doe Donny Doe
1 Very Nice Project# Project# Project#
2 Good Projector Project Title Project Title Project Title
3 Telephone 1234 9:30-17.00 8-13.00 12-14.00
4 ROOM 01 (Morning) Alan Smithee Susi Smithee Donald Smithee
5 Very Nice Project# Project# Project#
6 Good Projector Project Title Project Title Project Title
7 Telephone 1234 17:30-21.00 13.15-16.00 14.15-16.00
8 ROOM 02 (Morning) Jimmy Doe Duffy Duck Benny Blanco
9 Not So Nice Project# Project# Project#
10 Whiteboard Project Title Project Title Project Title
11 Telephone 5678 9:30-17.00 8-13.00 12-14.00
12 ROOM 02 (Afternoon) Doris Day Teddy Kaczinsky Ru Paul
13 Not so Nice Project# Project# Project#
14 Whiteboard Project Title Project Title Project Title
15 Telephone 5678 17:30-21.00 13.15-16.00 14.15-16.00
first = df.columns[0]
df[first] = df[first].where(df[first].str.contains("ROOM")).ffill()
df['group'] = df.index % 4
d = {0:'name', 1:'project', 2:'project title', 3: 'time'}
df1 = (df.set_index([first, 'group'])
.rename(columns = lambda x: pd.to_datetime(x, format='%d.%m.%Y'))
.stack()
.to_frame())
g = df1.groupby(level=[0,1]).cumcount()
df1 = (df1.set_index(g, append=True)[0]
.unstack(1)
.rename(columns=d)
.swaplevel(1,0)
.sort_index()
.rename_axis(index=['date', 'room', 'tmp'], columns=None)
)
print (df1)
name project project title \
date room tmp
2018-01-01 ROOM 01 (Morning) 0 John Doe Project# Project Title
3 Alan Smithee Project# Project Title
ROOM 02 (Afternoon) 0 Doris Day Project# Project Title
ROOM 02 (Morning) 0 Jimmy Doe Project# Project Title
2018-08-02 ROOM 01 (Morning) 1 Jane Doe Project# Project Title
4 Susi Smithee Project# Project Title
ROOM 02 (Afternoon) 1 Teddy Kaczinsky Project# Project Title
ROOM 02 (Morning) 1 Duffy Duck Project# Project Title
2018-08-03 ROOM 01 (Morning) 2 Donny Doe Project# Project Title
5 Donald Smithee Project# Project Title
ROOM 02 (Afternoon) 2 Ru Paul Project# Project Title
ROOM 02 (Morning) 2 Benny Blanco Project# Project Title
time
date room tmp
2018-01-01 ROOM 01 (Morning) 0 9:30-17.00
3 17:30-21.00
ROOM 02 (Afternoon) 0 17:30-21.00
ROOM 02 (Morning) 0 9:30-17.00
2018-08-02 ROOM 01 (Morning) 1 8-13.00
4 13.15-16.00
ROOM 02 (Afternoon) 1 13.15-16.00
ROOM 02 (Morning) 1 8-13.00
2018-08-03 ROOM 01 (Morning) 2 12-14.00
5 14.15-16.00
ROOM 02 (Afternoon) 2 14.15-16.00
ROOM 02 (Morning) 2 12-14.00
Upvotes: 1