Zusman
Zusman

Reputation: 666

dataframe structure manipulation

I have a very large csv file with more than 100k records.
In it, each record is one object with id and about 20~30 properties.
I need to manipulate it so each record is a triplet of id, none empty property, and a value.
I created a sample of a simple data frame to give an example.
Given the following data frame:

data = [{'id': 1, 'shape': 'circle', 'size': 10, 'color':'green'},
        {'id': 2, 'shape': 'square', 'color':'pink'},
        {'id': 3, 'shape': 'triangle', 'size': 5, 'color': 'black'},
        {'id': 4, 'shape': 'pentagon', 'size': 25}]
df = pd.DataFrame(data)
df
Out[10]: 
   color  id     shape  size
0  green   1    circle  10.0
1   pink   2    square   NaN
2  black   3  triangle   5.0
3    NaN   4  pentagon  25.0

Is there an efficient way to get a result? It should look like this:

Out[17]: 
   id property     value
0   1   shape    circle
1   1    size        10
2   1   color     green
3   2   shape    square
4   2   color      pink
5   3   shape  triangle
6   3    size         5
7   3   color     black
8   4   shape  pentagon
9   4    size        25



of course, NaN cells should be skipped.

Upvotes: 4

Views: 65

Answers (2)

Karn Kumar
Karn Kumar

Reputation: 8816

You can simply try as @Jon mentioned df.melt() calling the melt() on the dataframe with the id_vars parameter set & sort the values.

>>> df.melt(id_vars='id', var_name='property').dropna().sort_values('id')
    id property     value
0    1    color     green
4    1    shape    circle
8    1     size        10
1    2    color      pink
5    2    shape    square
2    3    color     black
6    3    shape  triangle
10   3     size         5
7    4    shape  pentagon
11   4     size        25

Read the article it's interesting where it says..

  • Performance is ~30-40% faster than original melt, slightly slower than lreshape and much faster than wide_to_long.

Upvotes: 2

anky
anky

Reputation: 75080

Using pd.melt()

pd.melt(df,id_vars='id',var_name='Property').dropna().sort_values('id')
    id Property     value
0    1    color     green
4    1    shape    circle
8    1     size        10
1    2    color      pink
5    2    shape    square
2    3    color     black
6    3    shape  triangle
10   3     size         5
7    4    shape  pentagon
11   4     size        25

If order of index is important:

>>pd.melt(df,id_vars='id',var_name='Property').dropna().sort_values('id').reset_index(drop=True)

   id Property     value
0   1    color     green
1   1    shape    circle
2   1     size        10
3   2    color      pink
4   2    shape    square
5   3    color     black
6   3    shape  triangle
7   3     size         5
8   4    shape  pentagon
9   4     size        25

Upvotes: 4

Related Questions