Reputation: 103
I have the following data format in a file:
ID,var_name,var_value
1,ABC,This is abc1
1,DEF,This is def1
2,ABC,This is abc2
2,DEF,This is def2
2,GHI,This is ghi2
3,ABC,This is abc3
4,ABC,This is abc4
4,DEF,This is def4
also I have a header list = ['ABC','GHI']
In the above data set each "ID" will not necessarily have all the variables, however ID:2
contains the maximum number of variables (ABC,DEF,GHI). I need to convert the above dataset to the following nested list format:
[['ID','ABC','GHI'], [1,'This is abc1', ''],[2, 'This is abc2','This is ghi2'],[3,'This is abc3',''],[4,'This is abc4','']]
That means the list should:
I want to do this in Python 2.7, possibly using Pandas.
Upvotes: 1
Views: 1195
Reputation: 2553
I think you should try to stay in this beautiful panda's dataframe
df2=(df.pivot(index='ID', columns='var_name', values='var_value')
.fillna('').drop('DEF', axis=1).reset_index())
#output:
var_name ID ABC GHI
0 1 This is abc1
1 2 This is abc2 This is ghi2
2 3 This is abc3
3 4 This is abc4
But also you can do further to acheive the list:
print([df2.columns.tolist()] + df2.values.tolist())
[['ID', 'ABC', 'GHI'],
[1, 'This is abc1', ''],
[2, 'This is abc2', 'This is ghi2'],
[3, 'This is abc3', ''],
[4, 'This is abc4', '']]
Upvotes: 1
Reputation: 862406
Use:
L = ['ABC','GHI']
df1 = df.pivot('ID', 'var_name', 'var_value').fillna('')[L].reset_index()
print (df1)
var_name ID ABC GHI
0 1 This is abc1
1 2 This is abc2 This is ghi2
2 3 This is abc3
3 4 This is abc4
L1 = [df1.columns.tolist()] + df1.values.tolist()
print (L1)
[['ID', 'ABC', 'GHI'],
[1, 'This is abc1', ''],
[2, 'This is abc2', 'This is ghi2'],
[3, 'This is abc3', ''],
[4, 'This is abc4', '']]
Explanation:
pivot
, replace NaN
s by fillna
, convert subset for filtering columns and create column from index by reset_index
EDIT:
I try change order of values in list:
L = ['GHI', 'ABC']
df1 = df.pivot('ID', 'var_name', 'var_value').fillna('')[L].reset_index()
print (df1)
var_name ID GHI ABC
0 1 This is abc1
1 2 This is ghi2 This is abc2
2 3 This is abc3
3 4 This is abc4
L1 = [df1.columns.tolist()] + df1.values.tolist()
print (L1)
[['ID', 'GHI', 'ABC'],
[1, '', 'This is abc1'],
[2, 'This is ghi2', 'This is abc2'],
[3, '', 'This is abc3'],
[4, '', 'This is abc4']]
Upvotes: 1
Reputation: 30258
Alternatively, you can just set a multiindex
and unstack
:
In []:
L = ['ABC', 'GHI']
df = df.set_index(['ID', 'var_name'])['var_value'].unstack(fill_value='')[L].reset_index()
df
Out[]:
var_name ID ABC GHI
0 1 This is abc1
1 2 This is abc2 This is ghi2
2 3 This is abc3
3 4 This is abc4
In []:
[df.columns.tolist()] + df.values.tolist()
Out[]:
[['ID', 'ABC', 'GHI'],
[1, 'This is abc1', ''],
[2, 'This is abc2', 'This is ghi2'],
[3, 'This is abc3', ''],
[4, 'This is abc4', '']]
Upvotes: 1