Hemant
Hemant

Reputation: 59

How to split string with the values in their specific columns indexed on their label?

I have the following data

Index   Data
0       100CO
1       50CO-50PET
2       98CV-2EL
3       50CV-50CO
.       .
.       .
.       .

I have to create split the data format into different columns each with their own header and their values, the result should be as below:

Index   Data          CO        PET      CV    EL
0       100CO         100       0        0     0
1       50CO-50PET    50        50       0     0
2       98CV-2EL      0         0        98    2
3       50CV-50CO     50        0        50    0
.       .
.       .
.       .

The data is not limited to CO/PET/CV/EL, will need as many columns needed each displaying its corresponding value.

The .str.split('-', expand=True) function will only delimit the data and keep all first values in same column and does not rename each column.

Is there a way to implement this in python?

Upvotes: 2

Views: 340

Answers (3)

Rajesh Bhat
Rajesh Bhat

Reputation: 1000

Try this:

import pandas as pd
import re

df = pd.DataFrame({'Data':['100CO', '50CO-50PET', '98CV-2EL', '50CV-50CO']})

split_df = pd.DataFrame(df.Data.apply(lambda x: {re.findall('[A-Z]+', el)[0] : re.findall('[0-9]+', el)[0] \
                                  for el in x.split('-')}).tolist())
split_df = split_df.fillna(0)

df = pd.concat([df, split_df], axis = 1)

Upvotes: 0

Onyambu
Onyambu

Reputation: 79238

You could do:

df.Data.str.split('-').explode().str.split(r'(?<=\d)(?=\D)',expand = True). \
   reset_index().pivot('index',1,0).fillna(0).reset_index()

1  Index   CO  CV EL PET
0      0  100   0  0   0
1      1   50   0  0  50
2      2    0  98  2   0
3      3   50  50  0   0

Upvotes: 2

jezrael
jezrael

Reputation: 862791

Idea is first split values by -, then extract numbers and no numbers values to tuples, append to list and convert to dictionaries. It is passed in list comprehension to DataFrame cosntructor, replaced misisng values and converted to numeric:

import re

def f(x):
    L = []
    for val in x.split('-'):
        k, v = re.findall('(\d+)(\D+)', val)[0]
        L.append((v, k))
    return dict(L)

df = df.join(pd.DataFrame([f(x) for x in df['Data']], index=df.index).fillna(0).astype(int))
print (df)
         Data   CO  PET  CV  EL
0       100CO  100    0   0   0
1  50CO-50PET   50   50   0   0
2    98CV-2EL    0    0  98   2
3   50CV-50CO   50    0  50   0

If in data exist some values without number or number only solution should be changed for more general like:

print (df)
         Data
0       100CO
1  50CO-50PET
2    98CV-2EL
3   50CV-50CO
4         AAA
5          20

def f(x):
    L = []
    for val in x.split('-'):
        extracted = re.findall('(\d+)(\D+)', val)
        if len(extracted) > 0:
            k, v = extracted[0]
            L.append((v, k))
        else:
            if val.isdigit():
                L.append(('No match digit', val)) 
            else:
                L.append((val, 0))
    return dict(L)
    

df = df.join(pd.DataFrame([f(x) for x in df['Data']], index=df.index).fillna(0).astype(int))
print (df)
         Data   CO  PET  CV  EL  AAA  No match digit
0       100CO  100    0   0   0    0               0
1  50CO-50PET   50   50   0   0    0               0
2    98CV-2EL    0    0  98   2    0               0
3   50CV-50CO   50    0  50   0    0               0
4         AAA    0    0   0   0    0               0
5          20    0    0   0   0    0              20

Upvotes: 1

Related Questions