Induraj PR
Induraj PR

Reputation: 304

Regex: Exclude alphabets and extract numbers alone

I have a dataframe column that contains data similar to this,

1015 - 1540
56.56 - 156
34.46Sq. Meter 
36Sq. Meter 
1000Sq. Meter

I need to remove the alphabets and make the data in the column look like below.

1015 - 1540
56.56 - 156
34.46 
36
1000

I tried the below regex pattern, however, it seems to not work properly. Thanks in advance for helping.

(\d*\s\-\s\d*)|(\d*[\.]?\d*)(?=\w)

Upvotes: 0

Views: 148

Answers (2)

Bohemian
Bohemian

Reputation: 425208

Delete (by replacing with a blank) all unwanted chars, which match this regex:

[^\d .-]|(?<! )-(?! )|(?<=\D)[.]|(?<!-) (?!-)

See live demo.

Use re.sub(thisRegex, '', yourStr), ie:

cleaned = re.sub('[^\d .-]|(?<! )-(?! )|(?<=\D)[.]|(?<!-) (?!-)', '', str)

The regex is an OR or these:

  • [^\d .-] any char not a digit, space, dot or dash
  • (?<! )-(?! ) dash not next to a space
  • (?<=\D)[.] dot preceded by a non digit
  • (?<!-) (?!-) space not next to a dash

If this gets heavy use, consider compiling the regex once and reusing it.

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627103

You can use

df['result'] = df['col'].str.extract(r'(\d+(?:\.\d+)?(?:\s*-\s*\d+(?:\.\d+)?)?)', expand=False)

See the regex demo. The regex contains a single capturing group (Series.str.extract requires at least one capturing group to return a value) that matches

  • \d+(?:\.\d+)? - one or more digits and then an optional sequence of a dot and one or more digits
  • (?: - start of a non-capturing group:
    • \s*-\s* - a hyphen enclosed with zero or more whitespaces
    • \d+(?:\.\d+)? - one or more digits and then an optional sequence of a dot and one or more digits
  • )? - end of the group, match 1 or 0 times (i.e. it is optional).

Upvotes: 1

Related Questions