Reputation: 149
I have a pandas data frame data
with several columns. One of these columns is GEN
. This column contains german cities as strings. Some of these cities are in a bad format, meaning that they have values like "Frankfurt a.Main"
. For every element in data['GEN']
I would like to replace every expression of the form "\.[A-ZÄÖÜ]"
(i.e. dot followed by upper case letter) by the corresponding expression "\.\b[A-ZÄÖÜ]"
. For example
"Frankfurt a.Main"
becomes "Frankfurt a. Main"
"Frankfurt a.d.Oder"
becomes "Frankfurt a.d. Oder"
and so on.I am pretty sure that pandas.Series.str.contains
and pandas.Series.str.replace
are helpful here, but one of my problems is that I don't know how to put the replacement task in a form that can be used by the above functions.
Upvotes: 0
Views: 118
Reputation: 163632
You could assert a dot to the left using a positive lookbehind (?<=\.)
and match one of [A-ZÄÖÜ]
In the replacement use a space followed by the full match using \g<0>
import pandas as pd
pattern = r"(?<=\.)[A-ZÄÖÜ]"
items = [
"Frankfurt a.Main",
"Frankfurt a.d.Oder"
]
data = pd.DataFrame(items, columns=["GEN"])
data['GEN'] = data['GEN'].str.replace(pattern, r' \g<0>')
print(data)
Output
GEN
0 Frankfurt a. Main
1 Frankfurt a.d. Oder
Upvotes: 1
Reputation: 37902
You can use pandas.Series.str.replace
to capture the two groups that compose a german city name in your original data and then add a whitespace between them.
Try this :
data['GEN'] = data['GEN'].str.replace(r'(\w+\s.*\.)(\w*)', r'\1 \2', regex=True)
0 Frankfurt a. Main
1 Frankfurt a.d. Oder
Upvotes: 1