Jhilbran
Jhilbran

Reputation: 31

Split a column in Python pandas

I'm sorry if I can't explain properly the issue I'm facing since I don't really understand it that much. I'm starting to learn Python and to practice I try to do projects that I face in my day to day job, but using Python. Right now I'm stuck with a project and would like some help or guidance, I have a dataframe that looks like this

Index  Country   Name    IDs
0      USA       John    PERSID|12345
                         SSO|John123
                         STARTDATE|20210101
                         WAVE|WAVE39
--------------------------------------------
1      UK        Jane    PERSID|25478
                         SSO|Jane123
                         STARTDATE|20210101
                         WAVE|WAVE40

(I apologize since I can't create a table on this post since the separator of the ids is a | ) but you get the idea, every person has 4 IDs and they are all on the same "cell" of the dataframe, each ID separated from its value by pipes, I need to split those ID's from their values, and put them on separate columns so I get something like this

index Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40

Now, adding to the complexity of the table itself, I have another issues, for example, the order of the ID's won't be the same for everyone and some of them will be missing some of the ID's. I honestly have no idea where to begin, the first thing I thought about trying was to split the IDs column by spaces and then split the result of that by pipes, to create a dictionary, convert it to a dataframe and then join it to my original dataframe using the index.

But as I said, my knowledge in python is quite pathetic, so that failed catastrophically, I only got to the first step of that plan with a Client_ids = df.IDs.str.split(), that returns a series with the IDs separated one from each other like ['PERSID|12345', 'SSO|John123', 'STARTDATE|20210101', 'WAVE|Wave39'] but I can't find a way to split it again because I keep getting an error saying the the list object doesn't have attribute 'split'

How should I approach this? what alternatives do I have to do it?

Thank you in advance for any help or recommendation

Upvotes: 2

Views: 299

Answers (4)

piRSquared
piRSquared

Reputation: 294218

Use regex named capture groups with pd.String.str.extract

def ng(x):
    return f'(?:{x}\|(?P<{x}>[^\n]+))?\n?'

fields = ['PERSID', 'SSO', 'STARTDATE', 'WAVE']

pat = ''.join(map(ng, fields))

df.drop('IDs', axis=1).join(df['IDs'].str.extract(pat))

  Country  Name PERSID      SSO STARTDATE    WAVE
0     USA  John  12345  John123  20210101  WAVE39
1      UK  Jane  25478  Jane123  20210101  WAVE40
2      CA  Jill  12345      NaN  20210201  WAVE41

Setup

Credit to @JoeFerndz for sample df.
NOTE: this sample has missing values in some 'IDs'.

df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],

["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],

["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])

Upvotes: 0

Joe Ferndz
Joe Ferndz

Reputation: 8508

You have a few options to consider to do this. Here's how I would do it.

I will split the values in IDs by \n and |. Then create a dictionary with key:value for each split of values of |. Then join it back to the dataframe and drop the IDs and temp columns.

import pandas as pd

df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],

["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],

["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])

    
df['temp'] = df['IDs'].str.split('\n|\|').apply(lambda x: {k:v for k,v in zip(x[::2],x[1::2])})

df = df.join(pd.DataFrame(df['temp'].values.tolist(), df.index))
df = df.drop(columns=['IDs','temp'],axis=1)

print (df)

With this approach, it does not matter if a row of data is missing. It will sort itself out.

The output of this will be:

Original DataFrame:

   Country       Name    IDs
0      USA       John    PERSID|12345
                         SSO|John123
                         STARTDATE|20210101
                         WAVE|WAVE39
1      UK        Jane    PERSID|25478
                         SSO|Jane123
                         STARTDATE|20210101
                         WAVE|WAVE40
2      CA        Jill    PERSID|12345
                         STARTDATE|20210201
                         WAVE|WAVE41

Updated DataFrame:

  Country  Name PERSID      SSO STARTDATE    WAVE
0     USA  John  12345  John123  20210101  WAVE39
1      UK  Jane  25478  Jane123  20210101  WAVE40
2      CA  Jill  12345      NaN  20210201  WAVE41

Note that Jill did not have a SSO value. It set the value to NaN by default.

Upvotes: 2

Yefet
Yefet

Reputation: 2086

Quick solution

remove_word = ["PERSID", "SSO" ,"STARTDATE" ,"WAVE"]
for i ,col in enumerate(remove_word):
  df[col] = df.IDs.str.replace('|'.join(remove_word), '', regex=True).str.split("|").str[i+1]

Upvotes: 0

jkhoo
jkhoo

Reputation: 31

First generate your dataframe

df1 = pd.DataFrame([["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """
PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""]], columns=['Country', 'Name', 'IDs'])

Then split the last cell using lambda

df2 = pd.DataFrame(list(df.apply(lambda r: {p:q for p,q in [x.split("|") for x in r.IDs.split()]}, axis=1).values))

Lastly concat the dataframes together.

df = pd.concat([df1, df2], axis=1)

Upvotes: 0

Related Questions