Sebastian Zeki
Sebastian Zeki

Reputation: 6874

How to split text in one panda column to create new columns with delimiter as the column name

I have (made up) text in a dataframe as follows:

My Hospital\r\n\r\nPatient Name MS Bilbo Baggins\r\nDate of Birth 12/11/1932\r\nHospital Number:122222D\r\nDate of Procedure 09/11/2018\r\n\r\nEndoscopist:Dr. Jonny BeGood:\r\n2nd Endoscopist:\r\nReferring Physician \r\nGeneral Practicioner Dr Gilbert Prior

I'd like to split the data according to delimiters so that the delimiter becomes the column name. I'd end up with

Raw Text                Patient Name       Date of Birth    Hospital Number   Date of Procedure:    Endoscopist:    Endoscopist:  Referring Physician :   General Practicioner:
                         MS Bilbo Baggins    12/11/1932      122222D           09/11/2018          Dr. Jonny BeGood    NA          NA                     NA

One tricky part is that the delimiters do not always appear in the same order but happy for that to be a second part to the answer if this is not an easy thing to solve.

As a first go I tried the following Mydf.body_analysed.str.extract('(Patient Name.*?Date of Birth)')

but nothing has been extracted and I suspect this is because of carriage returns which further complicates things.

Upvotes: 1

Views: 68

Answers (1)

Hamza usman ghani
Hamza usman ghani

Reputation: 2243

Use series.str.extract()

EXPLANATION

Regex Pattern : f'[\n]*{c}[:]*[\r]*[\n]*([A-Za-z0-9/. ]+)[:]*[\r]*'

  • [\n]*{c}: It will search for every value "c" followed by zero or more occurence of \n
  • [:]*[\r]*[\n]*: zero or more occurrence of : , \r, \n sequentially.
  • ([A-Za-z0-9/. ]+): Capturing desired word that can contain defined set of letters and symbols
  • [:]*[\r]* : zero or more occurrence of :, \r after the captured word.

Complete Code:

Mydf = pd.DataFrame({"body_analysed": ["My Hospital\r\n\r\nPatient Name MS Bilbo Baggins\r\nDate of Birth 12/11/1932\r\nHospital Number:122222D\r\nDate of Procedure 09/11/2018\r\n\r\nEndoscopist:Dr. Jonny BeGood:\r\n2nd Endoscopist:\r\nReferring Physician \r\nGeneral Practicioner Dr Gilbert Prior"]})

columns = ["Patient Name","Date of Birth","Hospital Number","Date of Procedure","Hospital Number","Endoscopist","2nd Endoscopist","General Practicioner"]
for c in columns:
    Mydf[c] = Mydf["body_analysed"].str.extract(f'[\n]*{c}[:]*[\r]*[\n]*([A-Za-z0-9/. ]+)[:]*[\r]*')

OUTPUT:

       body_analysed                                            Patient Name    Date of Birth   Hospital Number Date of Procedure   Endoscopist        2nd Endoscopist    General Practicioner
    0   My Hospital\r\n\r\nPatient Name MS Bilbo Baggi...   MS Bilbo Baggins    12/11/1932      122222D           09/11/2018        Dr. Jonny BeGood   Referring Physician  Dr Gilbert Prior

Upvotes: 2

Related Questions