Reputation: 7743
I have a dataframe like as shown below
df = pd.DataFrame({'val': ['>1234','<>','<1000','<test','31sadj',123,43.21]})
I would like to create 3 new columns
val_num
- will store ONLY NUMBER values that comes along with symbols ex: 1234 (from >1234) and 1000 (from <1000) but WILL NOT STORE 31 (from 31sadj) because it doesn't have any symbol
val_str
- will store only values a mix of NUMBER
,symbols
,ALPHABETS
or just plain alphabets ex: 31sadj. It can have any symbols except >
,<
,=
val_symbol
- will store ONLY 3 symbols like >
, <
, =
I tried the below but it isn't accurate
df['val_SYMBOL'] = df['val'].str.extract(r'([<>=]+)').fillna('=')
df['val_num'] = df['val'].str.extract(r'([0-9]+)')
df['val_str'] = df['val'].str.extract(r'([a-zA-Z0-9\s-]+)')
I expect my output to be like as shown below
Upvotes: 3
Views: 120
Reputation: 71687
Series.str.extract
We can use extract
with a regex pattern containing three capturing groups.
df['val'].astype(str).str.extract(
r'([<>=]+)?((?<=[<>=])\d+\.?\d*|\d+\.?\d*(?=$))?(.+)?').fillna({0: '='})
0 1 2
0 > 1234 NaN
1 <> NaN NaN
2 < 1000 NaN
3 < NaN test
4 = NaN 31sadj
5 = 123 NaN
6 = 43.21 NaN
Regex details
([<>=]+)?
: First capturing group matches zero or one time
[<>=]+
: matches one or more characters in the list [<>=]
((?<=[<>=])\d+\.?\d*|\d+\.?\d*(?=$))?
: : Second capturing group matches zero or one time
(?<=[<>=])\d+\.?\d*
: First alternative
(?<=[<>=])\d+\.?\d*
: Matches numbers that come after the symbol present in the list [<>=]
\d+\.?\d*(?=$)
: Second alternative matches numbers at the end of line(.+)?
: Third capturing group matches matches zero or one time
.+
: matches any character one or more times.Upvotes: 2
Reputation: 627128
You can use
df['val_SYMBOL'] = df['val'].astype(str).str.extract(r'([<>=]+)').fillna('=')
df['val_num'] = df['val'].astype(str).str.extract(r'\b(\d+(?:\.\d+)?)\b')
df['val_str'] = df['val'].astype(str).str.extract(r'([^<>=]*[a-zA-Z][^<>=]*)')
You want to work on a mixed data type column, so the first operation is to convert the data to string with astype(str)
.
The val_num
column is populated with \b(\d+(?:\.\d+)?)\b
matches, integer or float numbers matched as whole words (\b
stands for a word boundary).
The val_str
column is populated with ([^<>=]*[a-zA-Z][^<>=]*)
matches, that searches for zero or more chars other than <
, >
and =
, then a letter and then again zero or more chars other than <
, >
and =
.
The output I get:
>>> df
val val_SYMBOL val_num val_str
0 >1234 > 1234 NaN
1 <> <> NaN NaN
2 <1000 < 1000 NaN
3 <test < NaN test
4 31sadj = NaN 31sadj
5 123 = 123 NaN
6 43.21 = 43.21 NaN
Upvotes: 2