Charles
Charles

Reputation: 377

How to create columns from a string in a dataframe?

WHAT I HAVE:

import pandas as pd
inp = [{'long string':'ha: (tra: 1 la: 2) \n hi: (tra: 1 la: 2) \n ho: (tra: 1 la: 2)'}, 
{'long string':'hi: (tra: 1 la: 2) \n ha: (tra: 1 la: 2) \n ho: (tra: 1 la: 2)'}, 
{'long string':'ho: (tra: 1 la: 2) \n hi: (tra: 1 la: 2) \n ha: (tra: 1 la: 2)'}]
df = pd.DataFrame(inp)
df

GIVES

    long string
0   ha: (tra: 1 la: 2) \n hi: (tra: 1 la: 2) \n ho...
1   hi: (tra: 1 la: 2) \n ha: (tra: 1 la: 2) \n ho...
2   ho: (tra: 1 la: 2) \n hi: (tra: 1 la: 2) \n ha...

WHAT I WANT

inp = {'ha-tra':['1', '1', '1'], 'ha-la':['2', '2', '2'], 'hi-tra':['1', '1', '1'], 'hi-la':['2', '2', '2'],'ho-tra':['1', '1', '1'], 'ho-la':['2', '2', '2']}
df = pd.DataFrame(inp)
df

GIVES

    ha-tra  ha-la   hi-tra  hi-la   ho-tra  ho-la
0   1       2       1       2       1       2
1   1       2       1       2       1       2
2   1       2       1       2       1       2

CONTEXT

From a large string, I want to get each combination of (ha hi ho) and (tra la), and get the scores related to those combinations from the string. The problem is that the order of (ha hi ho) is not similar.

Upvotes: 2

Views: 96

Answers (2)

Nk03
Nk03

Reputation: 14949

One way to solve:

df1 = (
    df['long string'].str.extractall(
        r'.*?([a-z]+)\s*?:\s*?\(([a-z]+):\s*(\d+)\s*([a-z]+):\s*(\d+)\)')
    .droplevel("match")
    .set_index(0, append=True)
)

d1 = df1.iloc[:, :2]
d2 = df1.iloc[:, 2:]
d2.columns = d1.columns

df2 = pd.concat([d1, d2]).reset_index()
df2 = df2.pivot(index='level_0', columns=[0, 1], values=2)
df2.columns = df2.columns.map('-'.join)
df2 = df2.reset_index(drop=True)

ALTERNATIVE:

df2 = (
    (
        df['long string'].str.extractall(
            r'.*?([a-z]+)\s*?:\s*?\(([a-z]+):\s*(\d+)\s*([a-z]+):\s*(\d+)\)')
        .droplevel("match")
        .set_index(0, append=True)
        .apply(lambda x: x.values.reshape(-1, 2), axis=1)
        .explode()
        .apply(pd.Series)
        .add_prefix('val')
        .reset_index()
    ).pivot(index=['level_0'], columns=[0, 'val0'], values='val1')
).reset_index(drop=True)
df2.columns = df2.columns.map('-'.join)

OUTPUT:

  ha_la ha_tra hi_la hi_tra ho_la ho_tra
0     2      1     2      1     2      1
1     2      1     2      1     2      1
2     2      1     2      1     2      1

Upvotes: 2

Mustafa Aydın
Mustafa Aydın

Reputation: 18315

ndf = (df["long string"]
         .str.extractall(r"(ha|hi|ho):\s\((?:tra|la):\s(\d+)\s(?:tra|la):\s(\d+)\)")
         .droplevel("match")
         .set_index(0, append=True)
         .set_axis(["tra", "la"], axis=1)
         .unstack()
         .swaplevel(axis=1))
ndf.columns = ndf.columns.map("-".join)
  • Extract the desired parts with a regex
  • Drop the index level induced by extractall called match
  • Append the ha-hi-ho matches as the index (0 is first capturing group)
  • Rename the remaining columns tra and la
  • Unstack the ha-hi-ho index to the columns
  • Swap the ha-hi-ho and tra-la levels' order in columns so that ha-hi-ho is upper
  • Lastly join these levels of columns' names with a hyphen

to get

  ha-tra hi-tra ho-tra ha-la hi-la ho-la
0      1      1      1     2     2     2
1      1      1      1     2     2     2
2      1      1      1     2     2     2

Upvotes: 3

Related Questions