Reputation: 75
I have a csv with product data like:
Item,Val1,Val2,Val3,Val4,Val5
SomeProductName1,,,,,
SomeProductDetails1,,,,,
ProductGroupHeader1,,,,,
ProductInfo1,39,8,6,94,112
ProductInfo2,32,7,4,94,112
ProductGroupHeader2,,,,,
ProductInfo3,39,8,6,94,112
ProductInfo4,32,7,4,94,112
SomeProductName2,,,,,
SomeProductDetails2,,,,,
ProductGroupHeader21,,,,,
ProductInfo21,39,8,6,94,112
ProductInfo22,32,7,4,94,112
ProductGroupHeader2,,,,,
ProductInfo23,39,8,6,94,112
ProductInfo24,32,7,4,94,112
I need it as:
Item,Val1,Val2,Val3,Val4,Val5
SomeProductName1, SomeProductDetails1, ProductGroupHeader1,,,,,
SomeProductName1, SomeProductDetails1, ProductInfo1,39,8,6,94,112
SomeProductName1, SomeProductDetails1, ProductInfo2,32,7,4,94,112
SomeProductName1, SomeProductDetails1, ProductGroupHeader2,,,,,
SomeProductName1, SomeProductDetails1, ProductInfo3,39,8,6,94,112
SomeProductName1, SomeProductDetails1, ProductInfo4,32,7,4,94,112
SomeProductName2, SomeProductDetails2, ProductGroupHeader21,,,,,
SomeProductName2, SomeProductDetails2, ProductInfo21,39,8,6,94,112
SomeProductName2, SomeProductDetails2, ProductInfo22,32,7,4,94,112
SomeProductName2, SomeProductDetails2, ProductGroupHeader2,,,,,
SomeProductName2, SomeProductDetails2, ProductInfo23,39,8,6,94,112
SomeProductName2, SomeProductDetails2, ProductInfo24,32,7,4,94,112
In essence, I want to get the SomeProductName
and SomeProductDetails
from their respective rows, remove those rows and then add the values as 2 columns in the ProductInfo
rows
The csv has several thousand rows and my initial thought would be to just loop through updating and deleting rows according.
I then intend to pivot the data based on the ProductName
and possibly plus ProductDetails
I'm new to pandas and python and just wondered if there was an easier/more efficient way?
Upvotes: 2
Views: 67
Reputation: 29635
To fit your expected output you can do it with mask where all values are nan with filter
and isna
. Assuming the structure is rigorous, you can find Name and Details rows using shift
. Then concat
the Name and Detail columns, created with where
and ffill
, to df and select only the rows you want.
#get the rows with nan in all values columns
m = df.filter(like='Val').isna().all(1)
# get the rows with ProductName, it is where
# all val are nan and also where all val are nan two rows later (GroupHeader rows)
name = m&m.shift(-2)
# get the rows with ProductDetails, it is where
# all val are nan the row before (ProductName rows)
# and also all val are nan one row later (GroupHeader rows)
details = m & m.shift(-1) & m.shift(1)
# you can create the dataframe wth concat,
# use where to and ffill to keep name and details on followinf rows
df_ = (pd.concat([df['Item'].where(name).ffill().rename('Item_name'),
df['Item'].where(details).ffill().rename('Item_details'),
df],
axis=1)
[~(name|details)] #remove rows with only name and details
)
and you get
print (df_)
Item_name Item_product Item Val1 Val2 \
2 SomeProductName1 SomeProductDetails1 ProductGroupHeader1 NaN NaN
3 SomeProductName1 SomeProductDetails1 ProductInfo1 39.0 8.0
4 SomeProductName1 SomeProductDetails1 ProductInfo2 32.0 7.0
5 SomeProductName1 SomeProductDetails1 ProductGroupHeader2 NaN NaN
6 SomeProductName1 SomeProductDetails1 ProductInfo3 39.0 8.0
7 SomeProductName1 SomeProductDetails1 ProductInfo4 32.0 7.0
10 SomeProductName2 SomeProductDetails2 ProductGroupHeader21 NaN NaN
11 SomeProductName2 SomeProductDetails2 ProductInfo21 39.0 8.0
12 SomeProductName2 SomeProductDetails2 ProductInfo22 32.0 7.0
13 SomeProductName2 SomeProductDetails2 ProductGroupHeader2 NaN NaN
14 SomeProductName2 SomeProductDetails2 ProductInfo23 39.0 8.0
15 SomeProductName2 SomeProductDetails2 ProductInfo24 32.0 7.0
Val3 Val4 Val5
2 NaN NaN NaN
3 6.0 94.0 112.0
4 4.0 94.0 112.0
5 NaN NaN NaN
6 6.0 94.0 112.0
7 4.0 94.0 112.0
10 NaN NaN NaN
11 6.0 94.0 112.0
12 4.0 94.0 112.0
13 NaN NaN NaN
14 6.0 94.0 112.0
15 4.0 94.0 112.0
Edit, to add the groupheader as a column, you can create a similar mask and then use it the same way in the concat:
#rows where all values are nan but not next row
groupHeader = m&(~m).shift(-1)
df_ = (pd.concat([df['Item'].where(name).ffill().rename('Item_name'),
df['Item'].where(details).ffill().rename('Item_details'),
df['Item'].where(groupHeader).ffill().rename('Item_group'), #add this
df],
axis=1)
[~(name|details|groupHeader)] #remove also the rows with groupHeader only
)
Upvotes: 1