Reputation: 304

Regex: Exclude alphabets and extract numbers alone

I have a dataframe column that contains data similar to this,

1015 - 1540
56.56 - 156
34.46Sq. Meter 
36Sq. Meter 
1000Sq. Meter

I need to remove the alphabets and make the data in the column look like below.

1015 - 1540
56.56 - 156
34.46 
36
1000

I tried the below regex pattern, however, it seems to not work properly. Thanks in advance for helping.

(\d*\s\-\s\d*)|(\d*[\.]?\d*)(?=\w)

Upvotes: 0

Answers (2)

Bohemian

Reputation: 425208

Delete (by replacing with a blank) all unwanted chars, which match this regex:

[^\d .-]|(?<! )-(?! )|(?<=\D)[.]|(?<!-) (?!-)

See live demo.

Use re.sub(thisRegex, '', yourStr), ie:

cleaned = re.sub('[^\d .-]|(?<! )-(?! )|(?<=\D)[.]|(?<!-) (?!-)', '', str)

The regex is an OR or these:

[^\d .-] any char not a digit, space, dot or dash
(?<! )-(?! ) dash not next to a space
(?<=\D)[.] dot preceded by a non digit
(?<!-) (?!-) space not next to a dash

If this gets heavy use, consider compiling the regex once and reusing it.

Upvotes: 0

Wiktor Stribiżew

Reputation: 627103

You can use

df['result'] = df['col'].str.extract(r'(\d+(?:\.\d+)?(?:\s*-\s*\d+(?:\.\d+)?)?)', expand=False)

See the regex demo. The regex contains a single capturing group (Series.str.extract requires at least one capturing group to return a value) that matches

\d+(?:\.\d+)? - one or more digits and then an optional sequence of a dot and one or more digits
(?: - start of a non-capturing group:
- \s*-\s* - a hyphen enclosed with zero or more whitespaces
- \d+(?:\.\d+)? - one or more digits and then an optional sequence of a dot and one or more digits
)? - end of the group, match 1 or 0 times (i.e. it is optional).

Upvotes: 1

Regex: Exclude alphabets and extract numbers alone

Answers (2)

Related Questions