How to split text in one panda column to create new columns with delimiter as the column name

Question

I have (made up) text in a dataframe as follows:

My Hospital

Patient Name MS Bilbo Baggins
Date of Birth 12/11/1932
Hospital Number:122222D
Date of Procedure 09/11/2018

Endoscopist:Dr. Jonny BeGood:
2nd Endoscopist:
Referring Physician 
General Practicioner Dr Gilbert Prior

I'd like to split the data according to delimiters so that the delimiter becomes the column name. I'd end up with

Raw Text                Patient Name       Date of Birth    Hospital Number   Date of Procedure:    Endoscopist:    Endoscopist:  Referring Physician :   General Practicioner:
                         MS Bilbo Baggins    12/11/1932      122222D           09/11/2018          Dr. Jonny BeGood    NA          NA                     NA

One tricky part is that the delimiters do not always appear in the same order but happy for that to be a second part to the answer if this is not an easy thing to solve.

As a first go I tried the following Mydf.body_analysed.str.extract('(Patient Name.*?Date of Birth)')

but nothing has been extracted and I suspect this is because of carriage returns which further complicates things.

Hamza usman ghani · Accepted Answer

Use series.str.extract()

EXPLANATION

Regex Pattern : f'[ ]*{c}[:]*[ ]*[ ]*([A-Za-z0-9/. ]+)[:]*[ ]*'

[ ]*{c}: It will search for every value "c" followed by zero or more occurence of
[:]*[ ]*[ ]*: zero or more occurrence of : , , sequentially.
([A-Za-z0-9/. ]+): Capturing desired word that can contain defined set of letters and symbols
[:]*[ ]* : zero or more occurrence of :, after the captured word.

Complete Code:

Mydf = pd.DataFrame({"body_analysed": ["My Hospital

Patient Name MS Bilbo Baggins
Date of Birth 12/11/1932
Hospital Number:122222D
Date of Procedure 09/11/2018

Endoscopist:Dr. Jonny BeGood:
2nd Endoscopist:
Referring Physician 
General Practicioner Dr Gilbert Prior"]})

columns = ["Patient Name","Date of Birth","Hospital Number","Date of Procedure","Hospital Number","Endoscopist","2nd Endoscopist","General Practicioner"]
for c in columns:
    Mydf[c] = Mydf["body_analysed"].str.extract(f'[
]*{c}[:]*[
]*[
]*([A-Za-z0-9/. ]+)[:]*[
]*')

OUTPUT:

       body_analysed                                            Patient Name    Date of Birth   Hospital Number Date of Procedure   Endoscopist        2nd Endoscopist    General Practicioner
    0   My Hospital

Patient Name MS Bilbo Baggi...   MS Bilbo Baggins    12/11/1932      122222D           09/11/2018        Dr. Jonny BeGood   Referring Physician  Dr Gilbert Prior

How to split text in one panda column to create new columns with delimiter as the column name

Answers (1)

Related Questions