Reputation: 6874
I have (made up) text in a dataframe as follows:
My Hospital\r\n\r\nPatient Name MS Bilbo Baggins\r\nDate of Birth 12/11/1932\r\nHospital Number:122222D\r\nDate of Procedure 09/11/2018\r\n\r\nEndoscopist:Dr. Jonny BeGood:\r\n2nd Endoscopist:\r\nReferring Physician \r\nGeneral Practicioner Dr Gilbert Prior
I'd like to split the data according to delimiters so that the delimiter becomes the column name. I'd end up with
Raw Text Patient Name Date of Birth Hospital Number Date of Procedure: Endoscopist: Endoscopist: Referring Physician : General Practicioner:
MS Bilbo Baggins 12/11/1932 122222D 09/11/2018 Dr. Jonny BeGood NA NA NA
One tricky part is that the delimiters do not always appear in the same order but happy for that to be a second part to the answer if this is not an easy thing to solve.
As a first go I tried the following Mydf.body_analysed.str.extract('(Patient Name.*?Date of Birth)')
but nothing has been extracted and I suspect this is because of carriage returns which further complicates things.
Upvotes: 1
Views: 68
Reputation: 2243
Use series.str.extract()
EXPLANATION
Regex Pattern : f'[\n]*{c}[:]*[\r]*[\n]*([A-Za-z0-9/. ]+)[:]*[\r]*'
[\n]*{c}
: It will search for every value "c" followed by zero or more occurence of \n[:]*[\r]*[\n]*
: zero or more occurrence of : , \r, \n sequentially.([A-Za-z0-9/. ]+)
: Capturing desired word that can contain defined set of letters and symbols[:]*[\r]*
: zero or more occurrence of :, \r after the captured word.Complete Code:
Mydf = pd.DataFrame({"body_analysed": ["My Hospital\r\n\r\nPatient Name MS Bilbo Baggins\r\nDate of Birth 12/11/1932\r\nHospital Number:122222D\r\nDate of Procedure 09/11/2018\r\n\r\nEndoscopist:Dr. Jonny BeGood:\r\n2nd Endoscopist:\r\nReferring Physician \r\nGeneral Practicioner Dr Gilbert Prior"]})
columns = ["Patient Name","Date of Birth","Hospital Number","Date of Procedure","Hospital Number","Endoscopist","2nd Endoscopist","General Practicioner"]
for c in columns:
Mydf[c] = Mydf["body_analysed"].str.extract(f'[\n]*{c}[:]*[\r]*[\n]*([A-Za-z0-9/. ]+)[:]*[\r]*')
OUTPUT:
body_analysed Patient Name Date of Birth Hospital Number Date of Procedure Endoscopist 2nd Endoscopist General Practicioner
0 My Hospital\r\n\r\nPatient Name MS Bilbo Baggi... MS Bilbo Baggins 12/11/1932 122222D 09/11/2018 Dr. Jonny BeGood Referring Physician Dr Gilbert Prior
Upvotes: 2