Reputation: 500
I'm trying to read the following format and save it in a Dataframe.
0 21:1.00 42:1.00 63:1.00
0 9: .18 18: .82 32: .15 36: .62
1 8:1.00 22: .06
...
The format consists in:
21:1.00
means that the attribute 21
has a value of 1.The problem I'm facing is that this is a sparse representation of a dataset, and I want something like:
class attr0 attr1 attr2 ... attrn
0 0 0 1 ... 3
A dense representation
Upvotes: 1
Views: 66
Reputation: 35676
One approach is to split and expand the values into two columns and convert to int
and float
.
Then pivot so that the first column becomes the column headers, and the second becomes the values.
Then just cleanup the display by adding the attr_ prefix, and renaming the index column to class:
import pandas as pd
df = pd.DataFrame({0: ["21:1.00", "42:1.00", "63:1.00", None],
1: ["9: .18", "18: .82", "32: .15", "36: .62"],
2: ["8:1.00", "22: .06", None, None]})
df = df.stack() \
.str.split(':', expand=True) \
.astype({0: int, 1: float}) \
.droplevel(0) \
.reset_index() \
.pivot(index='index', columns=0, values=1) \
.reset_index() \
.add_prefix('attr') \
.fillna(0) \
.rename(columns={'attrindex': 'class'})
print(df.to_string(index=False))
Output:
class attr8 attr9 attr18 attr21 attr22 attr32 attr36 attr42 attr63 0 0.0 0.00 0.00 1.0 0.00 0.00 0.00 1.0 1.0 1 0.0 0.18 0.82 0.0 0.00 0.15 0.62 0.0 0.0 2 1.0 0.00 0.00 0.0 0.06 0.00 0.00 0.0 0.0
Expand into multiple columns
df = df.stack() \
.str.split(':', expand=True) \
.astype({0: int, 1: float})
print(df.to_string())
0 1
0 0 21.0 1.00
1 9.0 0.18
2 8.0 1.00
1 0 42.0 1.00
1 18.0 0.82
2 22.0 0.06
2 0 63.0 1.00
1 32.0 0.15
3 1 36.0 0.62
Pivot into desired format
df = df \
.droplevel(0) \
.reset_index() \
.pivot(index='index', columns=0, values=1) \
.reset_index()
print(df.to_string())
0 index 8 9 18 21 22 32 36 42 63
0 0 NaN NaN NaN 1.0 NaN NaN NaN 1.0 1.0
1 1 NaN 0.18 0.82 NaN NaN 0.15 0.62 NaN NaN
2 2 1.0 NaN NaN NaN 0.06 NaN NaN NaN NaN
Cleanup, rename, etc.
df = df \
.add_prefix('attr') \
.fillna(0) \
.rename(columns={'attrindex': 'class'})
Upvotes: 1