Abhinav Ralhan
Abhinav Ralhan

Reputation: 585

Issue with formatting the dataset

Having scraped this data from XML and storing it to a Pandas DataFrame, I wanted to ask if, and how, to format the data according to requirements.

Dataset Image (Updated)

Problem 1: Basically every three rows need to be combined by removing the NaNs, to represent one row. For example, the correct way to represent this data would be:

  qid, qualid, val

0 2065887850, 227, (0,0,0,0,0,0,0,0,0,0)
1 2021905255,  30, (49214,.....,...,....) 

Problem 2:

In some places, val variable is actually missing (every third row in original data) and I need to place a NaN for that. The desired output in this case would be. Ideally, I need a solution which solves both problems.

  qid, qualid, val

0 2065887850, 227, (0,0,0,0,0,0,0,0,0,0)
1 2121905255,  23, (49214,.....,...,....) 
2 2834347850,  17, NaN
3 2121905255, 930, (14124,.....,...,....) 
4 2465212110, 227, (2,2,1,4,6,7,0,0,0,0)
5 2921905225, 130, NaN

Upvotes: 1

Views: 40

Answers (1)

Naga kiran
Naga kiran

Reputation: 4607

You can try of converting dataframe to string type and groupby 3 rows and add the rows,

      col_1   col_2     col_3
0   29021214    NaN     NaN
1   NaN         227     NaN
2   NaN         NaN     0,0,0,0
4   20218343    NaN     NaN
5   NaN         30      NaN
7   324234      NaN     NaN
8   NaN         532     NaN
9   NaN         NaN     NaN


temp = df.col_1.notnull().cumsum()
df = df.replace(np.nan,'').astype(str)
df.groupby(temp).sum().replace('',np.nan)

Out:

    col_1       col_2   col_3
col_1           
1   29021214    227    0,0,0,0
2   20218343    30     NaN
3   324234      532.0   NaN

Upvotes: 1

Related Questions