Reputation: 377
WHAT I HAVE:
import pandas as pd
inp = [{'long string':'ha: (tra: 1 la: 2) \n hi: (tra: 1 la: 2) \n ho: (tra: 1 la: 2)'},
{'long string':'hi: (tra: 1 la: 2) \n ha: (tra: 1 la: 2) \n ho: (tra: 1 la: 2)'},
{'long string':'ho: (tra: 1 la: 2) \n hi: (tra: 1 la: 2) \n ha: (tra: 1 la: 2)'}]
df = pd.DataFrame(inp)
df
GIVES
long string
0 ha: (tra: 1 la: 2) \n hi: (tra: 1 la: 2) \n ho...
1 hi: (tra: 1 la: 2) \n ha: (tra: 1 la: 2) \n ho...
2 ho: (tra: 1 la: 2) \n hi: (tra: 1 la: 2) \n ha...
WHAT I WANT
inp = {'ha-tra':['1', '1', '1'], 'ha-la':['2', '2', '2'], 'hi-tra':['1', '1', '1'], 'hi-la':['2', '2', '2'],'ho-tra':['1', '1', '1'], 'ho-la':['2', '2', '2']}
df = pd.DataFrame(inp)
df
GIVES
ha-tra ha-la hi-tra hi-la ho-tra ho-la
0 1 2 1 2 1 2
1 1 2 1 2 1 2
2 1 2 1 2 1 2
CONTEXT
From a large string, I want to get each combination of (ha hi ho) and (tra la), and get the scores related to those combinations from the string. The problem is that the order of (ha hi ho) is not similar.
Upvotes: 2
Views: 96
Reputation: 14949
One way to solve:
df1 = (
df['long string'].str.extractall(
r'.*?([a-z]+)\s*?:\s*?\(([a-z]+):\s*(\d+)\s*([a-z]+):\s*(\d+)\)')
.droplevel("match")
.set_index(0, append=True)
)
d1 = df1.iloc[:, :2]
d2 = df1.iloc[:, 2:]
d2.columns = d1.columns
df2 = pd.concat([d1, d2]).reset_index()
df2 = df2.pivot(index='level_0', columns=[0, 1], values=2)
df2.columns = df2.columns.map('-'.join)
df2 = df2.reset_index(drop=True)
ALTERNATIVE:
df2 = (
(
df['long string'].str.extractall(
r'.*?([a-z]+)\s*?:\s*?\(([a-z]+):\s*(\d+)\s*([a-z]+):\s*(\d+)\)')
.droplevel("match")
.set_index(0, append=True)
.apply(lambda x: x.values.reshape(-1, 2), axis=1)
.explode()
.apply(pd.Series)
.add_prefix('val')
.reset_index()
).pivot(index=['level_0'], columns=[0, 'val0'], values='val1')
).reset_index(drop=True)
df2.columns = df2.columns.map('-'.join)
OUTPUT:
ha_la ha_tra hi_la hi_tra ho_la ho_tra
0 2 1 2 1 2 1
1 2 1 2 1 2 1
2 2 1 2 1 2 1
Upvotes: 2
Reputation: 18315
ndf = (df["long string"]
.str.extractall(r"(ha|hi|ho):\s\((?:tra|la):\s(\d+)\s(?:tra|la):\s(\d+)\)")
.droplevel("match")
.set_index(0, append=True)
.set_axis(["tra", "la"], axis=1)
.unstack()
.swaplevel(axis=1))
ndf.columns = ndf.columns.map("-".join)
extractall
called match
ha-hi-ho
matches as the index (0
is first capturing group)tra
and la
ha-hi-ho
index to the columnsha-hi-ho
and tra-la
levels' order in columns so that ha-hi-ho
is upperto get
ha-tra hi-tra ho-tra ha-la hi-la ho-la
0 1 1 1 2 2 2
1 1 1 1 2 2 2
2 1 1 1 2 2 2
Upvotes: 3