Reputation: 29064
I would like to separate dots and text from numbers in a dataframe.
The dataframe looks like this:
Net.Liq.37584957
Haircut48216354
Deficit10631397
NaN
Haircutperassetclass
Equity31349682
Commodity12461964
FixedIncome663451
Currency3741257
Tried this: df.col.str.extract('([a-zA-Z]+)([^a-zA-Z]+)', expand=True), but the first row has both dots and text, hence it comes out in this way.
0 1
0 Net .
1 Haircut 48216354
2 Deficit 10631397
3 NaN NaN
4 NaN NaN
5 Equity 31349682
6 Commodity 12461964
7 FixedIncome 663451
8 Currency 3741257
How do i solve this?
Upvotes: 1
Views: 106
Reputation: 30971
Assuming that the column of interest in the source DataFrame has name Txt, run:
df.Txt.str.extract(r'(?P<Letters>[a-z.]*)(?P<Digits>\d*)', flags=re.I)
(import re
required).
The result for your data sample is:
Letters Digits
0 Net.Liq. 37584957
1 Haircut 48216354
2 Deficit 10631397
3 NaN NaN
4 Haircutperassetclass
5 Equity 31349682
6 Commodity 12461964
7 FixedIncome 663451
8 Currency 3741257
Note: The first column has name Letters, but you wrote that you want to separate:
so this column contains actually letters and dots.
Upvotes: 1
Reputation: 626845
You may use
^(.*?)(?:\.?(\d+))?$
See the regex demo
Details
^
- start of string(.*?)
- Group 1: any 0+ chars, as few as possible(?:\.?(\d+))?
- an optional sequence of:
\.?
- an optional dot(\d+)
- Group 2: one or more digits$
- end of string.In code,
df[['A', 'B']] = df['Col'].str.extract(r'(.*?)(?:\.?(\d+))?$', expand=True)
Output:
>>> df
Col A B
0 Net.Liq.37584957 Net.Liq 37584957
1 Haircut48216354 Haircut 48216354
2 Deficit10631397 Deficit 10631397
3 NaN NaN NaN
4 Haircutperassetclass Haircutperassetclass NaN
5 Equity31349682 Equity 31349682
6 Commodity12461964 Commodity 12461964
7 FixedIncome663451 FixedIncome 663451
8 Currency3741257 Currency 3741257
Upvotes: 1
Reputation: 82765
Looks like you need pattern ([a-zA-Z.]+)(\d+)?
Ex:
df = pd.DataFrame({"Col": ['Net.Liq.37584957', 'Haircut48216354', 'Deficit10631397', 'NaN', 'Haircutperassetclass', 'Equity31349682', 'Commodity12461964', 'FixedIncome663451', 'Currency3741257']})
df[['A', "B"]] = df['Col'].str.extract(r"([a-zA-Z.]+)(\d+)?", expand=True)
print(df)
Output:
Col A B
0 Net.Liq.37584957 Net.Liq. 37584957
1 Haircut48216354 Haircut 48216354
2 Deficit10631397 Deficit 10631397
3 NaN NaN NaN
4 Haircutperassetclass Haircutperassetclass NaN
5 Equity31349682 Equity 31349682
6 Commodity12461964 Commodity 12461964
7 FixedIncome663451 FixedIncome 663451
8 Currency3741257 Currency 3741257
Upvotes: 1