The Great
The Great

Reputation: 7743

Segregate a column data based on regex using pandas

I have a dataframe like as shown below

df = pd.DataFrame({'val': ['>1234','<>','<1000','<test','31sadj',123,43.21]})

I would like to create 3 new columns

val_num - will store ONLY NUMBER values that comes along with symbols ex: 1234 (from >1234) and 1000 (from <1000) but WILL NOT STORE 31 (from 31sadj) because it doesn't have any symbol

val_str - will store only values a mix of NUMBER,symbols,ALPHABETS or just plain alphabets ex: 31sadj. It can have any symbols except >,<,=

val_symbol - will store ONLY 3 symbols like >, <, =

I tried the below but it isn't accurate

df['val_SYMBOL'] = df['val'].str.extract(r'([<>=]+)').fillna('=')
df['val_num'] = df['val'].str.extract(r'([0-9]+)')
df['val_str'] = df['val'].str.extract(r'([a-zA-Z0-9\s-]+)') 

I expect my output to be like as shown below

enter image description here

Upvotes: 3

Views: 120

Answers (2)

Shubham Sharma
Shubham Sharma

Reputation: 71687

Series.str.extract

We can use extract with a regex pattern containing three capturing groups.

df['val'].astype(str).str.extract(
    r'([<>=]+)?((?<=[<>=])\d+\.?\d*|\d+\.?\d*(?=$))?(.+)?').fillna({0: '='})

    0      1       2
0   >   1234     NaN
1  <>    NaN     NaN
2   <   1000     NaN
3   <    NaN    test
4   =    NaN  31sadj
5   =    123     NaN
6   =  43.21     NaN

Regex details

  • ([<>=]+)? : First capturing group matches zero or one time

    • [<>=]+ : matches one or more characters in the list [<>=]
  • ((?<=[<>=])\d+\.?\d*|\d+\.?\d*(?=$))? : : Second capturing group matches zero or one time

    • (?<=[<>=])\d+\.?\d* : First alternative
      • (?<=[<>=])\d+\.?\d* : Matches numbers that come after the symbol present in the list [<>=]
    • \d+\.?\d*(?=$) : Second alternative matches numbers at the end of line
  • (.+)? : Third capturing group matches matches zero or one time

    • .+ : matches any character one or more times.

See the online regex demo

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627128

You can use

df['val_SYMBOL'] = df['val'].astype(str).str.extract(r'([<>=]+)').fillna('=')
df['val_num'] = df['val'].astype(str).str.extract(r'\b(\d+(?:\.\d+)?)\b')
df['val_str'] = df['val'].astype(str).str.extract(r'([^<>=]*[a-zA-Z][^<>=]*)')

You want to work on a mixed data type column, so the first operation is to convert the data to string with astype(str).

The val_num column is populated with \b(\d+(?:\.\d+)?)\b matches, integer or float numbers matched as whole words (\b stands for a word boundary).

The val_str column is populated with ([^<>=]*[a-zA-Z][^<>=]*) matches, that searches for zero or more chars other than <, > and =, then a letter and then again zero or more chars other than <, > and =.

The output I get:

>>> df
      val val_SYMBOL val_num val_str
0   >1234          >    1234     NaN
1      <>         <>     NaN     NaN
2   <1000          <    1000     NaN
3   <test          <     NaN    test
4  31sadj          =     NaN  31sadj
5     123          =     123     NaN
6   43.21          =   43.21     NaN

Upvotes: 2

Related Questions