splitting column and extracting country, cities and organisation names

Question

I have a dataframe with address column as below. I would like to split this column such that the countries, cities and institutions are separated into different columns. The challenging part is that each of the cell has a different structure. The common thing in all these cells is that they end in city, country but in some situations such as row index 3, there are multiple entries.

    id      address
------------------------------------------------------------------------------------------------
0   223     Department of GI and HPB Surgery, University Hospital Northern Norway, Breivika, Tromsø, Norway; Institute of Clinical Medicine, University of Tromsø, Tromsø, Norway
1   223     Department of Surgery, University Hospital Maastricht, Maastricht, The Netherlands; NUTRIM School for Nutrition, Toxicology and Metabolism, Maastricht University, Maastricht, The Netherlands
2   223     Department of Surgery, University Hospital Maastricht, Maastricht, The Netherlands; NUTRIM School for Nutrition, Toxicology and Metabolism, Maastricht University, Maastricht, The Netherlands
3   223     Department of Surgery, Närebro University Hospital, Närebro; Department of Molecular Medicine and Surgery, Karolinska Institutet, Stockholm, Sweden'}, {'id': '9900', 'name': 'Närebro universitet, Institutionen för läkarutbildning
4   223     Clinical Surgery, University of Edinburgh, Royal Infirmary of Edinburgh, Edinburgh, UK
5   223     Division of Gastrointestinal Surgery, Nottingham Digestive Diseases Centre, National Institute for Health Research, Biomedical Research Unit, Nottingham University Hospitals, Queen's Medical Centre, Nottingham, UK
6   223     Hospital of Lausanne (CHUV), Lausanne, Switzerland
7   223     Department of GI and HPB Surgery, University Hospital Northern Norway, Breivika, Tromsø, Norway; Institute of Clinical Medicine, University of Tromsø, Tromsø, Norway
8   223     Clinical Surgery, University of Edinburgh, Royal Infirmary of Edinburgh, Edinburgh, UK
9   223     Department of GI and HPB Surgery, University Hospital Northern Norway, Breivika, Tromsø, Norway; Institute of Clinical Medicine, University of Tromsø, Tromsø, Norway

can anyone help here please?

Note the above dataframe is a subset of my dataframe and that's why the id column has same values for all the rows. The original dataframe has about 10k rows and that's why could not share it here.

piterbarg · Accepted Answer

This is probably too simplistic for your 10K rows database but hopefully sets you in the right direction.

Note that row index 3 is ill-formed as it has curly brackets etc -- looks like the parsing issues when creating/scraping your data. In the below this is ignored, in practice you want to sanitize your inputs or fix the issue upstream.

First I create a toy dataset from your data:

import pandas as pd
from io import StringIO
raw_data = StringIO(
"""
!Id!address
0!223!Department of GI and HPB Surgery, University Hospital Northern Norway, Breivika, Tromsø, Norway; Institute of Clinical Medicine, University of Tromsø, Tromsø, Norway
1!223!Department of Surgery, University Hospital Maastricht, Maastricht, The Netherlands; NUTRIM School for Nutrition, Toxicology and Metabolism, Maastricht University, Maastricht, The Netherlands
2!223!Department of Surgery, University Hospital Maastricht, Maastricht, The Netherlands; NUTRIM School for Nutrition, Toxicology and Metabolism, Maastricht University, Maastricht, The Netherlands
3!223!Department of Surgery, Närebro University Hospital, Närebro; Department of Molecular Medicine and Surgery, Karolinska Institutet, Stockholm, Sweden'}, {'id': '9900', 'name': 'Närebro universitet, Institutionen för läkarutbildning
4!223!Clinical Surgery, University of Edinburgh, Royal Infirmary of Edinburgh, Edinburgh, UK
5!223!Division of Gastrointestinal Surgery, Nottingham Digestive Diseases Centre, National Institute for Health Research, Biomedical Research Unit, Nottingham University Hospitals, Queen's Medical Centre, Nottingham, UK
6!223!Hospital of Lausanne (CHUV), Lausanne, Switzerland
7!223!Department of GI and HPB Surgery, University Hospital Northern Norway, Breivika, Tromsø, Norway; Institute of Clinical Medicine, University of Tromsø, Tromsø, Norway
8!223!Clinical Surgery, University of Edinburgh, Royal Infirmary of Edinburgh, Edinburgh, UK
9!223!Department of GI and HPB Surgery, University Hospital Northern Norway, Breivika, Tromsø, Norway; Institute of Clinical Medicine, University of Tromsø, Tromsø, Norway
""")
data = pd.read_csv(raw_data, index_col=0, delimiter='!')

Next, for rows that have multiple addresses, I split and put them on separate rows in the dataframe. I assume they are separated by ';' always, as is the case in your example

data['address'] = data['address'].str.split(';')
data = data.explode('address')

Next I tokenize the address by splitting it by ','. Here address_tokens column will contain lists of tokens after this

data['address_tokens'] = data['address'].str.split(',')

Now for each row we combine the tokens into a list of three elements containing [tokens[0:N-3] joined together by a comma, token[N-2], token[N-1]] and we identify these as institution, city, country

data['address_3'] = data['address_tokens'].apply(lambda tks: [','.join(tks[:-3]), tks[-2], tks[-1]] )
data[['institution', 'city', 'country']] = data['address_3'].apply(pd.Series)

I kept all the intermediate steps in the dataframe so you can see the results. The three columns ['institution', 'city', 'country'] contain what you asked for, except for some trouble with {,} etc coming from the original row index 3

splitting column and extracting country, cities and organisation names

Answers (1)

Related Questions