Reading custom sparse format in Pandas

Question

I'm trying to read the following format and save it in a Dataframe.

0    21:1.00   42:1.00   63:1.00
0    9: .18   18: .82   32: .15   36: .62
1    8:1.00   22: .06
...

The format consists in:

The first integer is the class of the sample (this is a classification dataset)
The following are the attributes. So for instance, in the first row, 21:1.00 means that the attribute 21 has a value of 1.

The problem I'm facing is that this is a sparse representation of a dataset, and I want something like:

class attr0 attr1 attr2 ... attrn
 0      0     0     1   ...  3

A dense representation

Henry Ecker · Accepted Answer

One approach is to split and expand the values into two columns and convert to int and float.

Then pivot so that the first column becomes the column headers, and the second becomes the values.

Then just cleanup the display by adding the attr_ prefix, and renaming the index column to class:

import pandas as pd

df = pd.DataFrame({0: ["21:1.00", "42:1.00", "63:1.00", None],
                   1: ["9: .18", "18: .82", "32: .15", "36: .62"],
                   2: ["8:1.00", "22: .06", None, None]})

df = df.stack() \
    .str.split(':', expand=True) \
    .astype({0: int, 1: float}) \
    .droplevel(0) \
    .reset_index() \
    .pivot(index='index', columns=0, values=1) \
    .reset_index() \
    .add_prefix('attr') \
    .fillna(0) \
    .rename(columns={'attrindex': 'class'})

print(df.to_string(index=False))

Output:

 class  attr8  attr9  attr18  attr21  attr22  attr32  attr36  attr42  attr63
     0    0.0   0.00    0.00     1.0    0.00    0.00    0.00     1.0     1.0
     1    0.0   0.18    0.82     0.0    0.00    0.15    0.62     0.0     0.0
     2    1.0   0.00    0.00     0.0    0.06    0.00    0.00     0.0     0.0

Expand into multiple columns

df = df.stack() \
    .str.split(':', expand=True) \
    .astype({0: int, 1: float})

print(df.to_string())

        0     1
0 0  21.0  1.00
  1   9.0  0.18
  2   8.0  1.00
1 0  42.0  1.00
  1  18.0  0.82
  2  22.0  0.06
2 0  63.0  1.00
  1  32.0  0.15
3 1  36.0  0.62

Pivot into desired format

df = df \
    .droplevel(0) \
    .reset_index() \
    .pivot(index='index', columns=0, values=1) \
    .reset_index()

print(df.to_string())

0  index    8     9    18   21    22    32    36   42   63
0      0  NaN   NaN   NaN  1.0   NaN   NaN   NaN  1.0  1.0
1      1  NaN  0.18  0.82  NaN   NaN  0.15  0.62  NaN  NaN
2      2  1.0   NaN   NaN  NaN  0.06   NaN   NaN  NaN  NaN

Cleanup, rename, etc.

df = df \
    .add_prefix('attr') \
    .fillna(0) \
    .rename(columns={'attrindex': 'class'})

Reading custom sparse format in Pandas

Answers (1)

Related Questions