Reputation: 1175
I tried these: https://stackoverflow.com/a/37683738/13865853, https://stackoverflow.com/a/50830098/13865853.
My dataframe is all strings but the dtype is object for reasons I read elsewhere on SO.
The columns are units of micronutrients in foods that look like this:
Life-Stage Group Arsenic Boron (mg/d) Calcium (mg/d) Chromium Copper (μg/d) \
0 <= 3.0 y nan g 3 mg 2500 mg nan g 1000 μg
1 <= 8.0 y nan g 6 mg 2500 mg nan g 3000 μg
Fluoride (mg/d) Iodine (μg/d) Iron (mg/d) Magnesium (mg/d) Manganese (mg/d) \
0 1.3 mg 200 μg 40 mg 65 mg 2 mg
1 2.2 mg 300 μg 40 mg 110 mg 3 mg
Molybdenum (μg/d) Nickel (mg/d) Phosphorus (g/d) Potassium Selenium (μg/d) \
0 300 μg 0.2 mg 3 g nan g 90 μg
1 600 μg 0.3 mg 3 g nan g 150 μg
Silicon Sulfate Vanadium (mg/d) Zinc (mg/d) Sodium Chloride (g/d) \
0 nan g nan g nan mg 7 mg nan g 2.3 g
1 nan g nan g nan mg 12 mg nan g 2.9 g
Vitamin A (μg/d) Vitamin C (mg/d) Vitamin D (μg/d) Vitamin E (mg/d) \
0 600.0 μg 400 mg 63.0 μg 200 mg
1 900.0 μg 650 mg 75.0 μg 300 mg
Vitamin K (μg/d) Thiamin (mg/d) Riboflavin (mg/d) Niacin (mg/d) \
0 nan μg nan mg nan mg 10 mg
1 nan μg nan mg nan mg 15 mg
Vitamin B6 (mg/d) Folate (μg/d) Vitamin B12 (μg/d) Pantothenic Acid (mg/d) \
0 30 mg 300 μg nan μg nan mg
1 40 mg 400 μg nan μg nan mg
Biotin (μg/d) Choline (mg/d) Carotenoids
0 nan μg 1.0 mg nan g
1 nan μg 1.0 mg nan g
I want to zero-out nan
and just get the numerical values as I want to multiply g
by 1000 and divide any ug
(\u03BCg
in Python for micro) by 1000 so that everything is in mg
so I can plot them on a bar graph in Plotly Dash.
But I'm stuck at extracting numbers.
Previously when I was making csv files after downloading the data, this worked but it now does not:
# extract numbers
new_df_arr = []
for _,df in df_dict.items():
df = df.astype(str)
df_copy = df.copy()
for i in range(1, len(df.columns)):
df_copy[df.columns[i]]=df_copy[df.columns[i]].str.extract('(\d+[.]?\d*)', expand=False) #replace(r'[^0-9]+','')
new_df_arr.append(df_copy)
# check df's
for df in new_df_arr:
print(df)
Upvotes: 3
Views: 203
Reputation: 1175
I just came up with another way reusing the code posted in my question although not elegant as @David's answer:
# extract numbers
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
dfc = df.copy().astype(str)
for i in range(1, len(dfc.columns)):
dfc[dfc.columns[i]]=dfc[dfc.columns[i]].str.extract('(\d+[.]?\d*)', expand=False) #replace(r'[^0-9]+','')
if 'mg' in dfc.columns[i]:
#val = pd.to_numeric(dfc[dfc.columns[i]], errors='coerce') * 1
val = dfc[dfc.columns[i]].astype(float) * 1.
dfc[dfc.columns[i]] = val
elif '\u03BCg' in dfc.columns[i]:
#val = pd.to_numeric(dfc[dfc.columns[i]], errors='coerce') / 1000.
val = dfc[dfc.columns[i]].astype(float) / 1000.
dfc[dfc.columns[i]] = val
elif 'g' in dfc.columns[i]:
val = dfc[dfc.columns[i]].astype(float) * 1000.
dfc[dfc.columns[i]] = val
#dfc = dfc.replace('NaN', 0)
dfc.fillna(0, inplace=True)
print(dfc)
print(dfc.dtypes)
The output is:
Life-Stage Group Arsenic Boron (mg/d) Calcium (mg/d) Chromium \
0 <= 3.0 y 0 3.0 2500.0 0
1 <= 8.0 y 0 6.0 2500.0 0
Copper (μg/d) Fluoride (mg/d) Iodine (μg/d) Iron (mg/d) \
0 1.0 1.3 0.2 40.0
1 3.0 2.2 0.3 40.0
Magnesium (mg/d) Manganese (mg/d) Molybdenum (μg/d) Nickel (mg/d) \
0 65.0 2.0 0.3 0.2
1 110.0 3.0 0.6 0.3
Phosphorus (g/d) Potassium Selenium (μg/d) Silicon Sulfate \
0 3000.0 0 0.09 0 0
1 3000.0 0 0.15 0 0
Vanadium (mg/d) Zinc (mg/d) Sodium Chloride (g/d)
0 0.0 7.0 0 2300.0
1 0.0 12.0 0 2900.0
Life-Stage Group object
Arsenic int64
Boron (mg/d) float64
Calcium (mg/d) float64
Chromium int64
Copper (μg/d) float64
Fluoride (mg/d) float64
Iodine (μg/d) float64
Iron (mg/d) float64
Magnesium (mg/d) float64
Manganese (mg/d) float64
Molybdenum (μg/d) float64
Nickel (mg/d) float64
Phosphorus (g/d) float64
Potassium int64
Selenium (μg/d) float64
Silicon int64
Sulfate int64
Vanadium (mg/d) float64
Zinc (mg/d) float64
Sodium int64
Chloride (g/d) float64
dtype: object
Upvotes: 0
Reputation: 16683
I used an input of just the first set of columns. You can:
s
that transforms the unit into what you want to multiply by mapping to a dictionary d
s
for each columndf = pd.DataFrame({'Life-Stage Group': {0: '<= 3.0 y', 1: '<= 8.0 y'},
'Arsenic': {0: 'nan g', 1: 'nan g'},
'Boron (mg/d)': {0: '3 mg', 1: '6 mg'},
'Calcium (mg/d)': {0: '2500 mg', 1: '2500 mg'},
'Chromium': {0: 'nan g', 1: 'nan g'},
'Copper (μg/d)': {0: '1000 μg', 1: '3000 μg'}})
d = {'μg' : .001, 'g' : 1000, 'mg' : 1}
for col in df.columns[1:]:
s = df[col].str.split(' ').str[1].map(d).astype(float)
df[col] = (df[col].str.extract('(\d+[.]?\d*)').astype(float) * s).fillna(0)
df
Out[1]:
Life-Stage Group Arsenic Boron (mg/d) Calcium (mg/d) Chromium Copper (μg/d)
0 <= 3.0 y 0.0 3.0 2500.0 0.0 1.0
1 <= 8.0 y 0.0 6.0 2500.0 0.0 3.0
Upvotes: 1