Pandas how to extract mix of ints and floats in dataframe columns

Question

I tried these: https://stackoverflow.com/a/37683738/13865853, https://stackoverflow.com/a/50830098/13865853.

My dataframe is all strings but the dtype is object for reasons I read elsewhere on SO.

The columns are units of micronutrients in foods that look like this:

  Life-Stage Group Arsenic Boron (mg/d) Calcium (mg/d) Chromium Copper (μg/d)  \
0         <= 3.0 y   nan g         3 mg        2500 mg    nan g       1000 μg   
1         <= 8.0 y   nan g         6 mg        2500 mg    nan g       3000 μg   

  Fluoride (mg/d) Iodine (μg/d) Iron (mg/d) Magnesium (mg/d) Manganese (mg/d)  \
0          1.3 mg        200 μg       40 mg            65 mg             2 mg   
1          2.2 mg        300 μg       40 mg           110 mg             3 mg   

  Molybdenum (μg/d) Nickel (mg/d) Phosphorus (g/d) Potassium Selenium (μg/d)  \
0            300 μg        0.2 mg              3 g     nan g           90 μg   
1            600 μg        0.3 mg              3 g     nan g          150 μg   

  Silicon Sulfate Vanadium (mg/d) Zinc (mg/d) Sodium Chloride (g/d)  \
0   nan g   nan g          nan mg        7 mg  nan g          2.3 g   
1   nan g   nan g          nan mg       12 mg  nan g          2.9 g   

  Vitamin A (μg/d) Vitamin C (mg/d) Vitamin D (μg/d) Vitamin E (mg/d)  \
0         600.0 μg           400 mg          63.0 μg           200 mg   
1         900.0 μg           650 mg          75.0 μg           300 mg   

  Vitamin K (μg/d) Thiamin (mg/d) Riboflavin (mg/d) Niacin (mg/d)  \
0           nan μg         nan mg            nan mg         10 mg   
1           nan μg         nan mg            nan mg         15 mg   

  Vitamin B6 (mg/d) Folate (μg/d) Vitamin B12 (μg/d) Pantothenic Acid (mg/d)  \
0             30 mg        300 μg             nan μg                  nan mg   
1             40 mg        400 μg             nan μg                  nan mg   

  Biotin (μg/d) Choline (mg/d) Carotenoids  
0        nan μg         1.0 mg       nan g  
1        nan μg         1.0 mg       nan g

I want to zero-out nan and just get the numerical values as I want to multiply g by 1000 and divide any ug (\u03BCg in Python for micro) by 1000 so that everything is in mg so I can plot them on a bar graph in Plotly Dash.
But I'm stuck at extracting numbers. Previously when I was making csv files after downloading the data, this worked but it now does not:

# extract numbers
new_df_arr = []
for _,df in df_dict.items():
    df = df.astype(str)
    df_copy = df.copy()
    for i in range(1, len(df.columns)):
        df_copy[df.columns[i]]=df_copy[df.columns[i]].str.extract('(\d+[.]?\d*)', expand=False) #replace(r'[^0-9]+','')
    new_df_arr.append(df_copy)
# check df's
for df in new_df_arr:
    print(df)

David Erickson · Accepted Answer

I used an input of just the first set of columns. You can:

Loop through columns and create a series s that transforms the unit into what you want to multiply by mapping to a dictionary d
Extract the digits and multiply by s for each column

df = pd.DataFrame({'Life-Stage Group': {0: '<= 3.0 y', 1: '<= 8.0 y'},
 'Arsenic': {0: 'nan g', 1: 'nan g'},
 'Boron (mg/d)': {0: '3 mg', 1: '6 mg'},
 'Calcium (mg/d)': {0: '2500 mg', 1: '2500 mg'},
 'Chromium': {0: 'nan g', 1: 'nan g'},
 'Copper (μg/d)': {0: '1000 μg', 1: '3000 μg'}})

d = {'μg' : .001, 'g' : 1000, 'mg' : 1}

for col in df.columns[1:]:
    s = df[col].str.split(' ').str[1].map(d).astype(float)
    df[col] = (df[col].str.extract('(\d+[.]?\d*)').astype(float) * s).fillna(0)
df
Out[1]: 
  Life-Stage Group  Arsenic  Boron (mg/d)  Calcium (mg/d)  Chromium  Copper (μg/d)
0         <= 3.0 y      0.0           3.0          2500.0       0.0            1.0
1         <= 8.0 y      0.0           6.0          2500.0       0.0            3.0

Pandas how to extract mix of ints and floats in dataframe columns

Answers (2)

Related Questions