John Laudun
John Laudun

Reputation: 407

How to normalize columns in a dataframe

I have a pandas dataframe which has the term frequencies for corpus with the terms as rows and the years as columns like so:

|       | term    |   2002 |   2003 |   2004 |   2005 |
|------:|:--------|-------:|-------:|-------:|-------:|
|  3708 | climate |      1 |     10 |      1 |     14 |
|  8518 | global  |     12 |     11 |      2 |     12 |
| 13276 | nuclear |     10 |      1 |      0 |      4 |

I would like to be able to normalize the values for each word by dividing them by the total number of words for a given year -- some years contain twice as many texts, so I trying to scale by year (like Google Books). I have looked at examples for how to scale for a single column, a la Chris Albon and I have seen examples here on SO for scaling all the columns, but every time I try to convert this dataframe to an array to scale, things choke on the fact that the term column isn't numbers. (I tried setting the terms column as index, but that didn't go well.) I can imagine a way to do this with a for loop, but almost every example of clean pandas code I read says not to use for loops because there's a pandas way of doing, well, everything.

What I would like is some way of saying:

for these columns [the years]:
    divide each row by the sum of all rows

That's it.

Upvotes: 0

Views: 6682

Answers (2)

Balaji Ambresh
Balaji Ambresh

Reputation: 5037

Try:

In [5]: %paste                                                                                                                                                                                                                                                                       
cols = ['2002', '2003', '2004', '2005']
df[cols] = df[cols] / df[cols].sum()

## -- End pasted text --

In [6]: df                                                                                                                                                                                                                                                                           
Out[6]: 
      term      2002      2003      2004      2005
0  climate  0.043478  0.454545  0.333333  0.466667
1   global  0.521739  0.500000  0.666667  0.400000
2  nuclear  0.434783  0.045455  0.000000  0.133333

Upvotes: 4

kait
kait

Reputation: 1357

Try this:

import pandas as pd

df = pd.DataFrame(
    columns=['term', '2002', '2003', '2004', '2005'],
    data=[['climate', 1, 10, 1, 14],
          ['global', 12, 11, 2, 12],
          ['nuclear', 10, 1, 0, 4], ])
normalized = df.select_dtypes('int').apply(lambda x: x / sum(x))
df = df.merge(
    right=normalized,
    left_index=True,
    right_index=True,
    suffixes=['', '_norm']
)

Returns

      term  2002  2003  2004  2005  2002_norm  2003_norm  2004_norm  2005_norm
0  climate     1    10     1    14   0.043478   0.454545   0.333333   0.466667
1   global    12    11     2    12   0.521739   0.500000   0.666667   0.400000
2  nuclear    10     1     0     4   0.434783   0.045455   0.000000   0.133333

Upvotes: 2

Related Questions