Reputation: 43
please help. Seems easy, just can't figure it out.
DataFrame (df) contains numbers. For each column:
* compute the mean and std
* compute a new value for each value in each row in each column
* change that value with the new value
Method 1
import numpy as np
import pandas as pd
n = 1
while n<len(df.column.values.tolist()):
col = df.values[:,n]
mean = sum(col)/len(col)
std = np.std(col, axis = 0)
for x in df[df.columns.values[n]]:
y = (float(x) - float(mean)) / float(std)
df.set_value(x, df.columns.values[n], y)
n = n+1
Method 2
labels = df.columns.values.tolist()
df2 = df.ix[:,0]
n = 1
while n<len(df.column.values.tolist()):
col = df.values[:,n]
mean = sum(col)/len(col)
std = np.std(col, axis = 0)
ls = []
for x in df[df.columns.values[n]]:
y = (float(x) - float(mean)) / float(std)
ls.append(y)
df2 = pd.DataFrame({labels[n]:str(ls)})
df1 = pd.concat([df1, df2], axis=1, ignore_index=True)
n = n+1
Error: ValueError: If using all scalar values, you must pass an index
Also tried the .apply method but the new DataFrame doesn't change the values.
print(df.to_json()):
{"col1":{"subj1":4161.97,"subj2":5794.73,"subj3":4740.44,"subj4":4702.84,"subj5":3818.94},"col2":{"subj1":13974.62,"subj2":19635.32,"subj3":17087.721851,"subj4":13770.461021,"subj5":11546.157578},"col3":{"subj1":270.7,"subj2":322.607708,"subj3":293.422314,"subj4":208.644585,"subj5":210.619961},"col4":{"subj1":5400.16,"subj2":7714.080365,"subj3":6023.026011,"subj4":5880.187272,"subj5":4880.056292}}
Upvotes: 0
Views: 1196
Reputation: 22453
It looks like you're trying to do operations on DataFrame columns and values as though DataFrames were simple lists or arrays, rather than in the vectorized / column-at-a-time way more usual for NumPy and Pandas work.
A simple, first-pass improvement might be:
# import your data
import json
df = pd.DataFrame(json.loads(json_text))
# loop over only numeric columns
for col in df.select_dtypes([np.number]):
# compute column mean and std
col_mean = df[col].mean()
col_std = df[col].std()
# adjust column to normalized values
df[col] = df[col].apply(lambda x: (x - col_mean) / col_std)
That is vectorized by column. It retains some explicit looping, but is straightforward and relatively beginner-friendly.
If you're comfortable with Pandas, it can done more compactly:
numeric_cols = list(df.select_dtypes([np.number]))
df[numeric_cols] = df[numeric_cols].apply(lambda col: (col - col.mean()) / col.std(), axis=0)
In your revised DataFrame, there are no string columns. But the earlier DataFrame had string columns, causing problems when they were computed upon, so let's be careful. This is a generic way to select numeric columns. If it's too much, you can simplify at the cost of generality by listing them explicitly:
numeric_cols = ['col1', 'col2', 'col3', 'col4']
Upvotes: 0
Reputation: 857
You are standard normalizing each column by removing the mean and scaling to unit variance. You can use scikit-learn's standardScaler for this:
from sklearn import preprocessing
scaler= preprocessing.StandardScaler()
new_df = pd.DataFrame(scaler.fit_transform(df.T), columns=df.columns, index=df.index)
Here is the documentation for the same
Upvotes: 1