mLstudent33
mLstudent33

Reputation: 1175

Pandas how to extract mix of ints and floats in dataframe columns

I tried these: https://stackoverflow.com/a/37683738/13865853, https://stackoverflow.com/a/50830098/13865853.

My dataframe is all strings but the dtype is object for reasons I read elsewhere on SO.

The columns are units of micronutrients in foods that look like this:

  Life-Stage Group Arsenic Boron (mg/d) Calcium (mg/d) Chromium Copper (μg/d)  \
0         <= 3.0 y   nan g         3 mg        2500 mg    nan g       1000 μg   
1         <= 8.0 y   nan g         6 mg        2500 mg    nan g       3000 μg   

  Fluoride (mg/d) Iodine (μg/d) Iron (mg/d) Magnesium (mg/d) Manganese (mg/d)  \
0          1.3 mg        200 μg       40 mg            65 mg             2 mg   
1          2.2 mg        300 μg       40 mg           110 mg             3 mg   

  Molybdenum (μg/d) Nickel (mg/d) Phosphorus (g/d) Potassium Selenium (μg/d)  \
0            300 μg        0.2 mg              3 g     nan g           90 μg   
1            600 μg        0.3 mg              3 g     nan g          150 μg   

  Silicon Sulfate Vanadium (mg/d) Zinc (mg/d) Sodium Chloride (g/d)  \
0   nan g   nan g          nan mg        7 mg  nan g          2.3 g   
1   nan g   nan g          nan mg       12 mg  nan g          2.9 g   

  Vitamin A (μg/d) Vitamin C (mg/d) Vitamin D (μg/d) Vitamin E (mg/d)  \
0         600.0 μg           400 mg          63.0 μg           200 mg   
1         900.0 μg           650 mg          75.0 μg           300 mg   

  Vitamin K (μg/d) Thiamin (mg/d) Riboflavin (mg/d) Niacin (mg/d)  \
0           nan μg         nan mg            nan mg         10 mg   
1           nan μg         nan mg            nan mg         15 mg   

  Vitamin B6 (mg/d) Folate (μg/d) Vitamin B12 (μg/d) Pantothenic Acid (mg/d)  \
0             30 mg        300 μg             nan μg                  nan mg   
1             40 mg        400 μg             nan μg                  nan mg   

  Biotin (μg/d) Choline (mg/d) Carotenoids  
0        nan μg         1.0 mg       nan g  
1        nan μg         1.0 mg       nan g  

I want to zero-out nan and just get the numerical values as I want to multiply g by 1000 and divide any ug (\u03BCg in Python for micro) by 1000 so that everything is in mg so I can plot them on a bar graph in Plotly Dash.
But I'm stuck at extracting numbers. Previously when I was making csv files after downloading the data, this worked but it now does not:

# extract numbers
new_df_arr = []
for _,df in df_dict.items():
    df = df.astype(str)
    df_copy = df.copy()
    for i in range(1, len(df.columns)):
        df_copy[df.columns[i]]=df_copy[df.columns[i]].str.extract('(\d+[.]?\d*)', expand=False) #replace(r'[^0-9]+','')
    new_df_arr.append(df_copy)
# check df's
for df in new_df_arr:
    print(df)

Upvotes: 3

Views: 203

Answers (2)

mLstudent33
mLstudent33

Reputation: 1175

I just came up with another way reusing the code posted in my question although not elegant as @David's answer:

# extract numbers
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

dfc = df.copy().astype(str)
for i in range(1, len(dfc.columns)):
    dfc[dfc.columns[i]]=dfc[dfc.columns[i]].str.extract('(\d+[.]?\d*)', expand=False) #replace(r'[^0-9]+','')
    if 'mg' in dfc.columns[i]:
        #val = pd.to_numeric(dfc[dfc.columns[i]], errors='coerce') * 1
        val = dfc[dfc.columns[i]].astype(float) * 1. 
        dfc[dfc.columns[i]] = val
    elif '\u03BCg' in dfc.columns[i]:
        #val = pd.to_numeric(dfc[dfc.columns[i]], errors='coerce') / 1000.
        val = dfc[dfc.columns[i]].astype(float) / 1000.
        dfc[dfc.columns[i]] = val
    elif 'g' in dfc.columns[i]:
        val = dfc[dfc.columns[i]].astype(float) * 1000.
        dfc[dfc.columns[i]] = val

#dfc = dfc.replace('NaN', 0)
dfc.fillna(0, inplace=True)
print(dfc)
print(dfc.dtypes)

The output is:

  Life-Stage Group  Arsenic  Boron (mg/d)  Calcium (mg/d)  Chromium  \
0         <= 3.0 y        0           3.0          2500.0         0   
1         <= 8.0 y        0           6.0          2500.0         0   

   Copper (μg/d)  Fluoride (mg/d)  Iodine (μg/d)  Iron (mg/d)  \
0            1.0              1.3            0.2         40.0   
1            3.0              2.2            0.3         40.0   

   Magnesium (mg/d)  Manganese (mg/d)  Molybdenum (μg/d)  Nickel (mg/d)  \
0              65.0               2.0                0.3            0.2   
1             110.0               3.0                0.6            0.3   

   Phosphorus (g/d)  Potassium  Selenium (μg/d)  Silicon  Sulfate  \
0            3000.0          0             0.09        0        0   
1            3000.0          0             0.15        0        0   

   Vanadium (mg/d)  Zinc (mg/d)  Sodium  Chloride (g/d)  
0              0.0          7.0       0          2300.0  
1              0.0         12.0       0          2900.0  
Life-Stage Group      object
Arsenic                int64
Boron (mg/d)         float64
Calcium (mg/d)       float64
Chromium               int64
Copper (μg/d)        float64
Fluoride (mg/d)      float64
Iodine (μg/d)        float64
Iron (mg/d)          float64
Magnesium (mg/d)     float64
Manganese (mg/d)     float64
Molybdenum (μg/d)    float64
Nickel (mg/d)        float64
Phosphorus (g/d)     float64
Potassium              int64
Selenium (μg/d)      float64
Silicon                int64
Sulfate                int64
Vanadium (mg/d)      float64
Zinc (mg/d)          float64
Sodium                 int64
Chloride (g/d)       float64
dtype: object

Upvotes: 0

David Erickson
David Erickson

Reputation: 16683

I used an input of just the first set of columns. You can:

  1. Loop through columns and create a series s that transforms the unit into what you want to multiply by mapping to a dictionary d
  2. Extract the digits and multiply by s for each column

df = pd.DataFrame({'Life-Stage Group': {0: '<= 3.0 y', 1: '<= 8.0 y'},
 'Arsenic': {0: 'nan g', 1: 'nan g'},
 'Boron (mg/d)': {0: '3 mg', 1: '6 mg'},
 'Calcium (mg/d)': {0: '2500 mg', 1: '2500 mg'},
 'Chromium': {0: 'nan g', 1: 'nan g'},
 'Copper (μg/d)': {0: '1000 μg', 1: '3000 μg'}})

d = {'μg' : .001, 'g' : 1000, 'mg' : 1}

for col in df.columns[1:]:
    s = df[col].str.split(' ').str[1].map(d).astype(float)
    df[col] = (df[col].str.extract('(\d+[.]?\d*)').astype(float) * s).fillna(0)
df
Out[1]: 
  Life-Stage Group  Arsenic  Boron (mg/d)  Calcium (mg/d)  Chromium  Copper (μg/d)
0         <= 3.0 y      0.0           3.0          2500.0       0.0            1.0
1         <= 8.0 y      0.0           6.0          2500.0       0.0            3.0     

Upvotes: 1

Related Questions