Reputation: 585
Having scraped this data from XML and storing it to a Pandas DataFrame, I wanted to ask if, and how, to format the data according to requirements.
Problem 1: Basically every three rows need to be combined by removing the NaNs, to represent one row. For example, the correct way to represent this data would be:
qid, qualid, val
0 2065887850, 227, (0,0,0,0,0,0,0,0,0,0)
1 2021905255, 30, (49214,.....,...,....)
Problem 2:
In some places, val variable is actually missing (every third row in original data) and I need to place a NaN for that. The desired output in this case would be. Ideally, I need a solution which solves both problems.
qid, qualid, val
0 2065887850, 227, (0,0,0,0,0,0,0,0,0,0)
1 2121905255, 23, (49214,.....,...,....)
2 2834347850, 17, NaN
3 2121905255, 930, (14124,.....,...,....)
4 2465212110, 227, (2,2,1,4,6,7,0,0,0,0)
5 2921905225, 130, NaN
Upvotes: 1
Views: 40
Reputation: 4607
You can try of converting dataframe to string type and groupby 3 rows and add the rows,
col_1 col_2 col_3
0 29021214 NaN NaN
1 NaN 227 NaN
2 NaN NaN 0,0,0,0
4 20218343 NaN NaN
5 NaN 30 NaN
7 324234 NaN NaN
8 NaN 532 NaN
9 NaN NaN NaN
temp = df.col_1.notnull().cumsum()
df = df.replace(np.nan,'').astype(str)
df.groupby(temp).sum().replace('',np.nan)
Out:
col_1 col_2 col_3
col_1
1 29021214 227 0,0,0,0
2 20218343 30 NaN
3 324234 532.0 NaN
Upvotes: 1