Reputation: 354
I'm trying to convert a very wide csv file with r rows and c columns into a dict or dataframe with r*c rows and three columns of the form row_id, col_name, col_value. Since the number of columns is very large -- more than 10,000 -- this can't be done manually.
Say for example I start with a pandas dataframe:
import pandas as pd
df = pd.DataFrame({'id': {0: '1', 1: '2', 2: '3'},
'c1': {0: 'S', 1: 'S', 2: 'D'},
'c2': {0: 'XX', 1: 'WX', 2: 'WX'},
'c3': {0: '32', 1: '63', 2: '32'}})
df = df.set_index('id')
that looks like this:
id c1 c2 c3
0 1 S XX 32
1 2 S WX 63
2 3 D WX 32
Keep in mind that this example dataframe has only three columns, but the solution needs to work through a very large number of columns.
The objective is to convert this to a dict or dataframe that looks like this:
id key value
0 1 c1 S
1 1 c2 XX
2 1 c3 32
3 2 c1 S
4 2 c2 WX
5 2 c3 63
6 3 c1 D
7 3 c2 WX
8 3 c3 32
I have written something that achieves the desired output, by iterating by column and row from dataframe into a new dataframe:
data = []
for i, row in df.iterrows():
for j, column in row.iteritems():
a_dictionary = i, j, column
data.append(a_dictionary)
df_out = pd.DataFrame(data)
df_out.columns = ['id', 'key', 'value']
But I've read one can and should avoid using for
loops in pandas and python. So what would a proper solution look like?
Upvotes: 2
Views: 1023
Reputation: 34086
You can do this:
In [212]: df.stack(dropna=False)\
.reset_index(name='Value')\
.rename(columns={'level_1': 'key'})
Out[212]:
id key Value
0 1 c1 S
1 1 c2 XX
2 1 c3 32
3 2 c1 S
4 2 c2 WX
5 2 c3 63
6 3 c1 D
7 3 c2 WX
8 3 c3 32
Upvotes: 3
Reputation: 13447
Have you considered using pd.melt
?
import pandas as pd
df = pd.DataFrame({'id': {0: '1', 1: '2', 2: '3'},
'c1': {0: 'S', 1: 'S', 2: 'D'},
'c2': {0: 'XX', 1: 'WX', 2: 'WX'},
'c3': {0: '32', 1: '63', 2: '32'}})
out = pd.melt(df,
id_vars=['id'],
value_vars=df.columns[1:])
id variable value
0 1 c1 S
1 2 c1 S
2 3 c1 D
3 1 c2 XX
4 2 c2 WX
5 3 c2 WX
6 1 c3 32
7 2 c3 63
8 3 c3 32
Upvotes: 2