qode
qode

Reputation: 43

Iterate and change value based on function in Python pandas

please help. Seems easy, just can't figure it out.
DataFrame (df) contains numbers. For each column:
* compute the mean and std
* compute a new value for each value in each row in each column
* change that value with the new value

Method 1

import numpy as np
import pandas as pd
n = 1
while n<len(df.column.values.tolist()):
    col = df.values[:,n]
    mean = sum(col)/len(col)
    std = np.std(col, axis = 0)
    for x in df[df.columns.values[n]]:
        y = (float(x) - float(mean)) / float(std)
        df.set_value(x, df.columns.values[n], y)
    n = n+1

Method 2

    labels = df.columns.values.tolist()
    df2 = df.ix[:,0]
    n = 1
    while n<len(df.column.values.tolist()):
        col = df.values[:,n]
        mean = sum(col)/len(col)
        std = np.std(col, axis = 0)
        ls = []
        for x in df[df.columns.values[n]]:
            y = (float(x) - float(mean)) / float(std)
            ls.append(y)
        df2 = pd.DataFrame({labels[n]:str(ls)})
        df1 = pd.concat([df1, df2], axis=1, ignore_index=True)
        n = n+1

Error: ValueError: If using all scalar values, you must pass an index

Also tried the .apply method but the new DataFrame doesn't change the values.

print(df.to_json()):
{"col1":{"subj1":4161.97,"subj2":5794.73,"subj3":4740.44,"subj4":4702.84,"subj5":3818.94},"col2":{"subj1":13974.62,"subj2":19635.32,"subj3":17087.721851,"subj4":13770.461021,"subj5":11546.157578},"col3":{"subj1":270.7,"subj2":322.607708,"subj3":293.422314,"subj4":208.644585,"subj5":210.619961},"col4":{"subj1":5400.16,"subj2":7714.080365,"subj3":6023.026011,"subj4":5880.187272,"subj5":4880.056292}}

Upvotes: 0

Views: 1196

Answers (2)

Jonathan Eunice
Jonathan Eunice

Reputation: 22453

It looks like you're trying to do operations on DataFrame columns and values as though DataFrames were simple lists or arrays, rather than in the vectorized / column-at-a-time way more usual for NumPy and Pandas work.

A simple, first-pass improvement might be:

# import your data
import json
df = pd.DataFrame(json.loads(json_text))

# loop over only numeric columns
for col in df.select_dtypes([np.number]):
    # compute column mean and std
    col_mean = df[col].mean()
    col_std  = df[col].std()
    # adjust column to normalized values
    df[col] = df[col].apply(lambda x: (x - col_mean) / col_std)

That is vectorized by column. It retains some explicit looping, but is straightforward and relatively beginner-friendly.

If you're comfortable with Pandas, it can done more compactly:

numeric_cols = list(df.select_dtypes([np.number]))
df[numeric_cols] = df[numeric_cols].apply(lambda col: (col - col.mean()) / col.std(), axis=0)

In your revised DataFrame, there are no string columns. But the earlier DataFrame had string columns, causing problems when they were computed upon, so let's be careful. This is a generic way to select numeric columns. If it's too much, you can simplify at the cost of generality by listing them explicitly:

numeric_cols = ['col1', 'col2', 'col3', 'col4']

Upvotes: 0

Sriram Sitharaman
Sriram Sitharaman

Reputation: 857

You are standard normalizing each column by removing the mean and scaling to unit variance. You can use scikit-learn's standardScaler for this:

from sklearn import preprocessing

scaler= preprocessing.StandardScaler()
new_df = pd.DataFrame(scaler.fit_transform(df.T), columns=df.columns, index=df.index)

Here is the documentation for the same

Upvotes: 1

Related Questions