Shanoo
Shanoo

Reputation: 1255

Extract numeric values from a variable from a pandas dataframe

I am trying to extract only numeric values from all the columns in a list, whether it is on the right, left or middle of any characters.

I have a dataframe that looks like the below:

df = pd.DataFrame({
    'A': ['1', 3, "1", "cad -2", 3, 4.876, np.nan], 
    'B': ['116', 'CAD -2.6399', 'CAD -3', '4 $', '$5%', 'A', '-1.2 2']
})
df

I tried the below code but it is removing - from column "A" row 4 and column "B" row 3

l = ["A", "B"]
for columns in l:
    if isinstance(df[columns], object):
        df[columns] = df[columns].astype('str').str.extract("([-+]?\d*\.\d+|\d+)").astype(float)
df

I want my final dataframe to look like below:

A      B
1     116 
3     -2.6399 
1     -3 
-2    4 
3     5 
4.876 NaN
NaN   -1.2

Upvotes: 1

Views: 4566

Answers (2)

Anton vBR
Anton vBR

Reputation: 18916

Try to use str.extract() and a regex to find float, something like this:

df['B'] = df['B'].str.extract("([-+]?\d*\.\d+|[-+]?\d+)").astype(float)

Note: 0 matches will return np.nan and 2+ matches will return the first match!

Full example:

import pandas as pd
df = pd.DataFrame({
    'A': [1, 3, 1, 2, 3, 4, 2], 
    'B': ['116', 'CAD 2.6399', 'CAD 3', '4 $', '$5%', 'A', '1.2 2']
})
df['B'] = df['B'].str.extract("([-+]?\d*\.\d+|[-+]?\d+)").astype(float)
print(df)

Returns:

   A         B
0  1  116.0000
1  3    2.6399
2  1    3.0000
3  2    4.0000
4  3    5.0000
5  4       NaN
6  2    1.2000

UPDATE: you can use this for multiple object columns:

for column in [i for i in df.columns if df[i].dtype == 'object']:
    df[column] = df[column].astype(str).str.extract("([-+]?\d*\.\d+|[-+]?\d+)").astype(float)

Based on regex found in this answer: How to extract a floating number from a string

Upvotes: 2

Filip
Filip

Reputation: 1

Try this:

def get_first_nbr_from_str(input_str):
    '''
    :param input_str: strings that contains digit and words
    :return: the number extracted from the input_str
    demo:
    'ab324.23.123xyz': 324.23
    '.5abc44': 0.5
    '''
    if not input_str and not isinstance(input_str, str):
        return 0
    out_number = ''
    for ele in input_str:
        if (ele == '.' and '.' not in out_number) or ele.isdigit():
            out_number += ele
        elif out_number:
            break
    return float(out_number)

And then: df['B'].apply(get_first_nbr_from_str)

Upvotes: 0

Related Questions