Reputation: 1998
Test data fileI am working with a dataframe that I created from a CSV file. The data has header rows throughout the data, which identify something about the rows below that data, until the next header row.
The data looks something like this.
2001| |colour |Price | Quantity sold<br>
Shoes|<br>
Blank | High heal Shoes| red |£22|44<br>
Blank | Low heal Shoes|red |£22|44<br>
Slippers|<br>
Blank | High heal Slippers| red |£22|44<br>
Blank | High heal Slippers| blue |£22|44<br>
Blank | Low heal Slippers| red |£22|44<br>
2002| |colour |Price | Quantity sold<br>
Shoes|<br>
Blank | High heal Shoes| red |£22|44<br>
Blank | Low heal Shoes|red |£22|44<br>
Slippers|<br>
Blank | High heal Slippers| red |£22|44<br>
Blank | High heal Slippers| blue |£22|44<br>
Blank | Low heal Slippers| red |£22|44<br>
What type of structure is this?
I need to read through this dataframe get all data on a particular item (say Slippers) for each year from the header row (so 2001, 2002 and so on). Even adding a row with the corresponding year next to each data row would help.
I will appreciate some help on how to do that?
Upvotes: 3
Views: 150
Reputation: 862511
Use:
df = pd.read_csv('test.csv')
#get value of first column (here 2001)
col = df.columns[0]
#forward fill last previous value
df[col] = df[col].ffill()
#convert first column to numeric
num = pd.to_numeric(df[col], errors='coerce')
#forward fill again, first group replace by value of first column name
df['Year'] = num.ffill().fillna(col)
#change columns names
df = df.rename(columns={col:'Shoes', 'Unnamed: 1':'Names'})
#remove unnecessary rows
df = df[num.isnull() & df['colour'].notnull()].reset_index(drop=True)
print (df)
Shoes Names colour price Quantity sold Year
0 Type A shoes Sub type A red 22 5 2001
1 Type A shoes Sub type A green 11 5 2001
2 Type A shoes Sub type A yellow 44 5 2001
3 Type A shoes Sub type B red 33 5 2001
4 Type A shoes Sub type B green 66 5 2001
5 Type A shoes Sub type B yellow 22 5 2001
6 Type B shoes Sub type A red 11 5 2001
7 Type B shoes Sub type A green 44 5 2001
8 Type B shoes Sub type A yellow 33 5 2001
9 Type B shoes Sub type B red 66 5 2001
10 Type B shoes Sub type B green 21 5 2001
11 Type B shoes Sub type B yellow 22 5 2001
12 Type A shoes Sub type A red 22 5 2002
13 Type A shoes Sub type A green 11 5 2002
14 Type A shoes Sub type A yellow 44 5 2002
15 Type A shoes Sub type B red 33 5 2002
16 Type A shoes Sub type B green 66 5 2002
17 Type A shoes Sub type B yellow 22 5 2002
18 Type B shoes Sub type A red 11 5 2002
19 Type B shoes Sub type A green 44 5 2002
20 Type B shoes Sub type A yellow 33 5 2002
21 Type B shoes Sub type B red 66 5 2002
22 Type B shoes Sub type B green 21 5 2002
23 Type B shoes Sub type B yellow 22 5 2002
EDIT:
df = pd.read_csv('testV2.csv', sep='\t')
#print (df)
#get value of first column (here 2001)
col = df.columns[0]
#forward fill last previous value
df[col] = df[col].ffill()
#convert first column to numeric
num = pd.to_numeric(df[col], errors='coerce')
#forward fill again, first group replace by value of first column name
df['Year'] = num.ffill().fillna(col)
#change columns names
df = df.rename(columns={col:'Top Category', 'Unnamed: 1':'Names'})
#remove unnecessary rows
df = df[num.isnull() & (df['Top Category'] != 'Top Category')].reset_index(drop=True)
print (df)
Top Category Names Colour Price Sold Year
0 Item 1 Type 1 - 2 NaN 2001
1 Item 2 Type 1 - 2 NaN 2001
2 Item 3 Type 1 red 2 5 2001
3 Item 3 Type 2 blue 2 5 2001
4 Item 3 Type 3 green 2 5 2001
5 item 4 Type 1 red 2 5 2001
6 item 4 Type 2 blue 3 NaN 2001
7 item 4 Type 3 green 3 NaN 2001
8 Item 1 Type 1 - 3 NaN 2002
9 Item 2 Type 1 - 3 NaN 2002
10 Item 3 Type 1 red 3 5 2002
11 Item 3 Type 2 blue 3 5 2002
12 Item 3 Type 3 green 3 5 2002
13 Item4 Type 1 red 3 NaN 2002
14 Item4 Type 2 blue 3 NaN 2002
15 Item4 Type 3 green 3 NaN 2002
16 Item 1 Type 1 - 3 NaN 2003
17 Item 2 Type 1 - 3 NaN 2003
18 Item 3 Type 1 red 3 5 2003
19 Item 3 Type 2 blue 3 5 2003
20 Item 3 Type 3 green 3 5 2003
21 Item4 Type 1 red 3 NaN 2003
22 Item4 Type 2 blue 3 NaN 2003
23 Item4 Type 3 green 3 NaN 2003
Upvotes: 2