Norhther
Norhther

Reputation: 500

Reading custom sparse format in Pandas

I'm trying to read the following format and save it in a Dataframe.

0    21:1.00   42:1.00   63:1.00
0    9: .18   18: .82   32: .15   36: .62
1    8:1.00   22: .06
...

The format consists in:

The problem I'm facing is that this is a sparse representation of a dataset, and I want something like:

class attr0 attr1 attr2 ... attrn
 0      0     0     1   ...  3

A dense representation

Upvotes: 1

Views: 66

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35676

One approach is to split and expand the values into two columns and convert to int and float.

Then pivot so that the first column becomes the column headers, and the second becomes the values.

Then just cleanup the display by adding the attr_ prefix, and renaming the index column to class:

import pandas as pd

df = pd.DataFrame({0: ["21:1.00", "42:1.00", "63:1.00", None],
                   1: ["9: .18", "18: .82", "32: .15", "36: .62"],
                   2: ["8:1.00", "22: .06", None, None]})

df = df.stack() \
    .str.split(':', expand=True) \
    .astype({0: int, 1: float}) \
    .droplevel(0) \
    .reset_index() \
    .pivot(index='index', columns=0, values=1) \
    .reset_index() \
    .add_prefix('attr') \
    .fillna(0) \
    .rename(columns={'attrindex': 'class'})

print(df.to_string(index=False))

Output:

 class  attr8  attr9  attr18  attr21  attr22  attr32  attr36  attr42  attr63
     0    0.0   0.00    0.00     1.0    0.00    0.00    0.00     1.0     1.0
     1    0.0   0.18    0.82     0.0    0.00    0.15    0.62     0.0     0.0
     2    1.0   0.00    0.00     0.0    0.06    0.00    0.00     0.0     0.0

Expand into multiple columns

df = df.stack() \
    .str.split(':', expand=True) \
    .astype({0: int, 1: float})

print(df.to_string())
        0     1
0 0  21.0  1.00
  1   9.0  0.18
  2   8.0  1.00
1 0  42.0  1.00
  1  18.0  0.82
  2  22.0  0.06
2 0  63.0  1.00
  1  32.0  0.15
3 1  36.0  0.62

Pivot into desired format

df = df \
    .droplevel(0) \
    .reset_index() \
    .pivot(index='index', columns=0, values=1) \
    .reset_index()

print(df.to_string())
0  index    8     9    18   21    22    32    36   42   63
0      0  NaN   NaN   NaN  1.0   NaN   NaN   NaN  1.0  1.0
1      1  NaN  0.18  0.82  NaN   NaN  0.15  0.62  NaN  NaN
2      2  1.0   NaN   NaN  NaN  0.06   NaN   NaN  NaN  NaN

Cleanup, rename, etc.

df = df \
    .add_prefix('attr') \
    .fillna(0) \
    .rename(columns={'attrindex': 'class'})

Upvotes: 1

Related Questions