Reputation: 304
I have a dataframe column that contains data similar to this,
1015 - 1540
56.56 - 156
34.46Sq. Meter
36Sq. Meter
1000Sq. Meter
I need to remove the alphabets and make the data in the column look like below.
1015 - 1540
56.56 - 156
34.46
36
1000
I tried the below regex pattern, however, it seems to not work properly. Thanks in advance for helping.
(\d*\s\-\s\d*)|(\d*[\.]?\d*)(?=\w)
Upvotes: 0
Views: 148
Reputation: 425208
Delete (by replacing with a blank) all unwanted chars, which match this regex:
[^\d .-]|(?<! )-(?! )|(?<=\D)[.]|(?<!-) (?!-)
See live demo.
Use re.sub(thisRegex, '', yourStr)
, ie:
cleaned = re.sub('[^\d .-]|(?<! )-(?! )|(?<=\D)[.]|(?<!-) (?!-)', '', str)
The regex is an OR or these:
[^\d .-]
any char not a digit, space, dot or dash(?<! )-(?! )
dash not next to a space(?<=\D)[.]
dot preceded by a non digit(?<!-) (?!-)
space not next to a dashIf this gets heavy use, consider compiling the regex once and reusing it.
Upvotes: 0
Reputation: 627103
You can use
df['result'] = df['col'].str.extract(r'(\d+(?:\.\d+)?(?:\s*-\s*\d+(?:\.\d+)?)?)', expand=False)
See the regex demo. The regex contains a single capturing group (Series.str.extract
requires at least one capturing group to return a value) that matches
\d+(?:\.\d+)?
- one or more digits and then an optional sequence of a dot and one or more digits(?:
- start of a non-capturing group:
\s*-\s*
- a hyphen enclosed with zero or more whitespaces\d+(?:\.\d+)?
- one or more digits and then an optional sequence of a dot and one or more digits)?
- end of the group, match 1 or 0 times (i.e. it is optional).Upvotes: 1