lakshmen
lakshmen

Reputation: 29064

Separate dots and text from numbers in python dataframe

I would like to separate dots and text from numbers in a dataframe.

The dataframe looks like this:

Net.Liq.37584957
Haircut48216354
Deficit10631397
             NaN
Haircutperassetclass
Equity31349682
Commodity12461964
FixedIncome663451
Currency3741257

Tried this: df.col.str.extract('([a-zA-Z]+)([^a-zA-Z]+)', expand=True), but the first row has both dots and text, hence it comes out in this way.

             0         1
0          Net         .
1      Haircut  48216354
2      Deficit  10631397
3          NaN       NaN
4          NaN       NaN
5       Equity  31349682
6    Commodity  12461964
7  FixedIncome    663451
8     Currency   3741257

How do i solve this?

Upvotes: 1

Views: 106

Answers (3)

Valdi_Bo
Valdi_Bo

Reputation: 30971

Assuming that the column of interest in the source DataFrame has name Txt, run:

df.Txt.str.extract(r'(?P<Letters>[a-z.]*)(?P<Digits>\d*)', flags=re.I)

(import re required).

The result for your data sample is:

                Letters    Digits
0              Net.Liq.  37584957
1               Haircut  48216354
2               Deficit  10631397
3                   NaN       NaN
4  Haircutperassetclass          
5                Equity  31349682
6             Commodity  12461964
7           FixedIncome    663451
8              Currency   3741257

Note: The first column has name Letters, but you wrote that you want to separate:

  • dots and text (actually letters),
  • from digits,

so this column contains actually letters and dots.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

You may use

^(.*?)(?:\.?(\d+))?$

See the regex demo

Details

  • ^ - start of string
  • (.*?) - Group 1: any 0+ chars, as few as possible
  • (?:\.?(\d+))? - an optional sequence of:
    • \.? - an optional dot
    • (\d+) - Group 2: one or more digits
  • $ - end of string.

In code,

df[['A', 'B']] = df['Col'].str.extract(r'(.*?)(?:\.?(\d+))?$', expand=True)

Output:

>>> df
                    Col                     A         B
0      Net.Liq.37584957               Net.Liq  37584957
1       Haircut48216354               Haircut  48216354
2       Deficit10631397               Deficit  10631397
3                   NaN                   NaN       NaN
4  Haircutperassetclass  Haircutperassetclass       NaN
5        Equity31349682                Equity  31349682
6     Commodity12461964             Commodity  12461964
7     FixedIncome663451           FixedIncome    663451
8       Currency3741257              Currency   3741257

Upvotes: 1

Rakesh
Rakesh

Reputation: 82765

Looks like you need pattern ([a-zA-Z.]+)(\d+)?

Ex:

df = pd.DataFrame({"Col": ['Net.Liq.37584957', 'Haircut48216354', 'Deficit10631397', 'NaN', 'Haircutperassetclass', 'Equity31349682', 'Commodity12461964', 'FixedIncome663451', 'Currency3741257']})
df[['A', "B"]] = df['Col'].str.extract(r"([a-zA-Z.]+)(\d+)?", expand=True)
print(df)

Output:

                    Col                     A         B
0      Net.Liq.37584957              Net.Liq.  37584957
1       Haircut48216354               Haircut  48216354
2       Deficit10631397               Deficit  10631397
3                   NaN                   NaN       NaN
4  Haircutperassetclass  Haircutperassetclass       NaN
5        Equity31349682                Equity  31349682
6     Commodity12461964             Commodity  12461964
7     FixedIncome663451           FixedIncome    663451
8       Currency3741257              Currency   3741257

Upvotes: 1

Related Questions