How to extract sections of a string in a pandas dataframe column

Question

I have a dataframe, df, where I would like specific separations of values within my column to display the first word and the number along with its 'T' value. I would like the first 'word' that is separated by '-', and its #T value. With the exception of 'Azure' case, where the first word is separated by '_'

It is tricky because some of the #T values are separated by '-', while others are separated by '_' ex. -12T in one of the values , as well as _14T in another value I would like to maintain the original values in the type column

Sample Data

data = {'type': ['Azure_Standard_E64is_v4_SPECIAL_DB-A.0', 'Azure_Standard_E64is_v4_SPECIAL_DB-A.0', 'Hello-HEL-HE-A6123-123A-12T_TYPE-v.A', 'Hello-HEL-HE-A6123-123A-12T_TYPE-v.E', 'Hello-HEL-HE-A6123-123A-50T_TYPE-v.C', 'Hello-HEL-HE-A6123-123A-50T_TYPE-v.A', 'Happy-HAP-HA-R650-570A-90T_version-v.A', 'Kind-KIN-KI-T490-NET_14T-A.0', 'Kind-KIN-KI-T490-NET_14T-A.0', 'AY14.5-fyy-FY-R770-256G-6.4T-R1-v.A', 'AY14.5-fyy-FY-R770-256G-6.4T-R1-v.A'], 'free': [6, 5, 10, 5, 1, 2, 10, 7, 6, 3, 0], 'use': [1, 1, 10, 1, 4, 1, 0, 4, 3, 0, 20], 'total': [7, 6, 20, 6, 5, 1, 10, 3, 2, 3, 20]}
df = pd.DataFrame(data)


                                      type  free  use  total
0   Azure_Standard_E64is_v4_SPECIAL_DB-A.0     6    1      7
1   Azure_Standard_E64is_v4_SPECIAL_DB-A.0     5    1      6
2     Hello-HEL-HE-A6123-123A-12T_TYPE-v.A    10   10     20
3     Hello-HEL-HE-A6123-123A-12T_TYPE-v.E     5    1      6
4     Hello-HEL-HE-A6123-123A-50T_TYPE-v.C     1    4      5
5     Hello-HEL-HE-A6123-123A-50T_TYPE-v.A     2    1      1
6   Happy-HAP-HA-R650-570A-90T_version-v.A    10    0     10
7             Kind-KIN-KI-T490-NET_14T-A.0     7    4      3
8             Kind-KIN-KI-T490-NET_14T-A.0     6    3      2
9      AY14.5-fyy-FY-R770-256G-6.4T-R1-v.A     3    0      3
10     AY14.5-fyy-FY-R770-256G-6.4T-R1-v.A     0   20     20

Desired:

   Name                                          type                free   use  total
  
   Azure_Standard_E64is_v4_SPECIAL_DB-A.0        Azure               6       1    7       
   Azure_Standard_E64is_v4_SPECIAL_DB-A.0        Azure               5       1    6                                       
   Hello-HEL-HE-A6123-123A-12T_TYPE-v.A          Hello   12T         10      10  20
   Hello-HEL-HE-A6123-123A-12T_TYPE-v.E          Hello   12T         5       1    6
   Hello-HEL-HE-A6123-123A-50T_TYPE-v.C          Hello   50T         1       4    5
   Hello-HEL-HE-A6123-123A-50T_TYPE-v.A          Hello   50T         2       1    1
   Happy-HAP-HA-R650-570A-90T_version-v.A        Happy   90T         10      0   10
   Kind-KIN-KI-T490-NET_14T-A.0                  Kind    14T         7      4    3
   Kind-KIN-KI-T490-NET_14T-A.0                  Kind    14T         6      3    2
   AY14.5-fyy-FY-R770-256G-6.4T-R1-v.A           AY14.5  6.4T        3      0    3
   AY14.5-fyy-FY-R770-256G-6.4T-R1-v.A           AY14.5  6.4T        0      20   20

Doing:

df['type']= df['type'].str.extract(r'(^\w+.\d|^\w+)')+' '+df['type'].str.extract(r'(\d.\d+T|\d+T)')

This works below, however, the 'AZURE' value disappears, and the original value is not maintained. I am still researching this, any assistance is appreciated.

jezrael · Accepted Answer

You can use Series.str.replace with Series.str.cat and last add Series.str.strip, also is added expand=False to Series.str.extract for Series.

For new column for second position is used DataFrame.insert.

s = (df['type'].str.replace('_','-')
               .str.extract(r'(^\w+.\d|^\w+)', expand=False)
               .str.cat(df['type'].str.extract(r'(\d.\d+T|\d+T)', expand=False), 
                        sep=' ', 
                        na_rep='')
               .str.strip())

Thank you @Trenton McKinney for another solution - splitting values and get first one values of lists:

s = (df['type'].str.split('_|-')
               .str[0]
               .str.cat(df['type'].str.extract(r'(\d.\d+T|\d+T)', expand=False), 
                        sep=' ', 
                        na_rep='')
               .str.strip())

df = df.rename(columns={'type': 'Name'})
df.insert(1, 'type', s)
print (df)
                                      Name         type  free  use  total
0   Azure_Standard_E64is_v4_SPECIAL_DB-A.0        Azure     6    1      7
1   Azure_Standard_E64is_v4_SPECIAL_DB-A.0        Azure     5    1      6
2     Hello-HEL-HE-A6123-123A-12T_TYPE-v.A    Hello 12T    10   10     20
3     Hello-HEL-HE-A6123-123A-12T_TYPE-v.E    Hello 12T     5    1      6
4     Hello-HEL-HE-A6123-123A-50T_TYPE-v.C    Hello 50T     1    4      5
5     Hello-HEL-HE-A6123-123A-50T_TYPE-v.A    Hello 50T     2    1      1
6   Happy-HAP-HA-R650-570A-90T_version-v.A    Happy 90T    10    0     10
7             Kind-KIN-KI-T490-NET_14T-A.0     Kind 14T     7    4      3
8             Kind-KIN-KI-T490-NET_14T-A.0     Kind 14T     6    3      2
9      AY14.5-fyy-FY-R770-256G-6.4T-R1-v.A  AY14.5 6.4T     3    0      3
10     AY14.5-fyy-FY-R770-256G-6.4T-R1-v.A  AY14.5 6.4T     0   20     20

How to extract sections of a string in a pandas dataframe column

Sample Data

Answers (1)

Related Questions