Reputation: 365
I have a dataset where I am trying to extract the simple town name from the longer messy version shown here. Most of them are followed by parentheses " (.*", but some do not follow this pattern and end in ":" (see line 200). Finally, there are some that do not have parentheses but split parts with a comma "," (see line 240, 246).
'Region'
196 Boston (Boston University, Boston College, Bos...
197 Bridgewater (Bridgewater State College)[2]
198 Cambridge (Harvard University, Massachusetts I...
199 Chestnut Hill (Boston College)
200 The Colleges of Worcester Consortium:
201 Dudley (Nichols College)
240 Faribault, South Central College
241 Mankato (Minnesota State University, Mankato),...
242 Marshall (Southwest Minnesota State University...
243 Moorhead (Minnesota State University, Moorhead...
244 Morris (University of Minnesota Morris)[2]
245 Northfield (Carleton College, St. Olaf College...
246 North Mankato, South Central College
247 St. Cloud (St. Cloud State University, The Col...
248 St. Joseph (College of Saint Benedict)[2]
249 St. Peter (Gustavus Adolphus College)[2]
What I would ideally like to see is:
'RegionName'
196 Boston
197 Bridgewater
198 Cambridge
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato
242 Marshall
243 Moorhead
244 Morris
245 Northfield
246 North Mankato
247 St. Cloud
248 St. Joseph
249 St. Peter
My code currently is:
df['RegionName'] = df['Region'].str.extract('(.*)[:(,]', expand=False)
But this gives me the weird result of not getting the parentheses right:
196 Boston (Boston University, Boston College, Bos...
197 Bridgewater
198 Cambridge (Harvard University, Massachusetts I...
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato (Minnesota State University, Mankato)
242 Marshall
243 Moorhead (Minnesota State University, Moorhead
244 Morris
245 Northfield (Carleton College
246 North Mankato
247 St. Cloud (St. Cloud State University
248 St. Joseph
249 St. Peter
I have also tried:
df['RegionName'] = df['Region'].str.extract('(.*)[ (.*|:|,]', expand=False)
I am not sure exactly how to extract the string using all three patterns at the same time. Would be open to a two line solution as well. Thanks (apologies if this is formatted poorly!)
Upvotes: 0
Views: 134
Reputation: 627419
You may just extract any 0 or more chars other than :
, ,
or (
at the beginning of a string with
df['RegionName'] = df['Region'].str.extract(r'^([^:(,]*)\b', expand=False)
If you are working with Python 2.x, use (?u)
at the beginning of the pattern so that the word boundary \b
could also match the right places in a Unicode string.
Details
^
- start of a string([^:(,]*)
- Group 1: zero or more (*
) consecutive occurrences of any char other than (the [^...]
forms a negated character class) :
, (
and ,
.\b
- a word boundary.See the regex demo and a Python 3 demo below:
>>> from pandas import DataFrame
>>> import pandas as pd
>>> item_list = ['Boston (Boston University, Boston College, Bos...','Bridgewater (Bridgewater State College)[2]','Cambridge (Harvard University, Massachusetts I...','Chestnut Hill (Boston College)','The Colleges of Worcester Consortium:','Dudley (Nichols College)','Faribault, South Central College','Mankato (Minnesota State University, Mankato),...','Marshall (Southwest Minnesota State University...','Moorhead (Minnesota State University, Moorhead...','Morris (University of Minnesota Morris)[2]','Northfield (Carleton College, St. Olaf College...','North Mankato, South Central College','St. Cloud (St. Cloud State University, The Col...','St. Joseph (College of Saint Benedict)[2]','St. Peter (Gustavus Adolphus College)[2]']
>>> df = pd.DataFrame(item_list, columns=['Region'])
>>> df['RegionName'] = df['Region'].str.extract(r'^([^:(,]*)\b', expand=False)
>>> df['RegionName']
RegionName
0 Boston
1 Bridgewater
2 Cambridge
3 Chestnut Hill
4 The Colleges of Worcester Consortium
5 Dudley
6 Faribault
7 Mankato
8 Marshall
9 Moorhead
10 Morris
11 Northfield
12 North Mankato
13 St. Cloud
14 St. Joseph
15 St. Peter
>>>
Upvotes: 1
Reputation: 174708
Since you only have three possible delimiters, you can take advantage of chained split(), since split will return the unmodified string if the delimiter is not found.
>>> s = """196 Boston (Boston University, Boston College, Bos...
... 197 Bridgewater (Bridgewater State College)[2]
... 198 Cambridge (Harvard University, Massachusetts I...
... 199 Chestnut Hill (Boston College)
... 200 The Colleges of Worcester Consortium:
... 201 Dudley (Nichols College)
... 240 Faribault, South Central College
... 241 Mankato (Minnesota State University, Mankato),...
... 242 Marshall (Southwest Minnesota State University...
... 243 Moorhead (Minnesota State University, Moorhead...
... 244 Morris (University of Minnesota Morris)[2]
... 245 Northfield (Carleton College, St. Olaf College...
... 246 North Mankato, South Central College
... 247 St. Cloud (St. Cloud State University, The Col...
... 248 St. Joseph (College of Saint Benedict)[2]
... 249 St. Peter (Gustavus Adolphus College)[2]"""
>>> for i in s.split('\n'):
... number, text = i.split('(')[0].split(',')[0].split(':')[0].split(' ',1)
... print('{} {}'.format(number, text.strip()))
...
196 Boston
197 Bridgewater
198 Cambridge
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato
242 Marshall
243 Moorhead
244 Morris
245 Northfield
246 North Mankato
247 St. Cloud
248 St. Joseph
249 St. Peter
You can use df.apply
to do the same transformation for your strings.
Upvotes: 1
Reputation: 1765
Use this regular expression:
([\w\s.]+)(?<!\s)
You can remove the negative look-behind (?<!\s)
at the end if you don't care about trailing white spaces.
Upvotes: 0