Andres M
Andres M

Reputation: 61

How to iterate through column headers in pandas

I'm trying to iterate through column headers and populate a list containing the mean revenue for each category. I'm getting this error and I tried a bunch of different things trying to fix it. So my dataframe looks like this (looks like I dont have enough points to paste an image, but here is a link to it): each row in the dataframe represent an idividual movie. column revenue_adj populated with revenue values, one column for each genre populated with TRUE/FALSE indicating that the movie falls under that specific genre.

this code works fine, but I would like to do it in a for loop insteadgenrev1

genrev1= df.query('action == True')['revenue_adj'].mean()
genrev2 = df.query('adventure == True')['revenue_adj'].mean()
genrev3 = df.query('animation == True')['revenue_adj'].mean()
genrev4 = df.query('comedy == True')['revenue_adj'].mean()
genrev5 = df.query('crime == True')['revenue_adj'].mean()
genrev6 = df.query('documentary == True')['revenue_adj'].mean()
genrev7 = df.query('drama == True')['revenue_adj'].mean()
genrev8 = df.query('family == True')['revenue_adj'].mean()

I tried the following loop without success:

genheads = ['action', 'adventure', 'animation', 'comedy', 'crime', 'documentary', 'drama', 'family']

genres2 = genheads
genrev = []
for gen in genres2:
    genrev.append(df.query('gen == True')['revenue_adj'].mean())

I get an error saying "gen is not defined" Instead of gen, I have tried:

df[gen] == True
'@df[gen] == True'
'@gen == True'

I been stuck on this for several weeks and any help on this would be soooo appreciated!!

Upvotes: 5

Views: 18538

Answers (5)

Ricardo del VBA
Ricardo del VBA

Reputation: 11

for column in df:
         print(df[column])

Upvotes: 1

kmh
kmh

Reputation: 1586

Welcome to stackoverflow!

I don't really see the need for the loop over the columns. Loops are typically an inefficient way to work on pandas dataframes and should be avoided if possible. With that in mind, I'll offer up a different solution. Apologies if you really DO need to loop over the columns. If you offer up a little more context in your question about why you've chosen the approach you have... where you've looked already and why that hasn't worked... it often helps guide people better to giving you more relevant answers.

Here's how I'd do it... without a loop.

import pandas as pd
import numpy as np

# mimick something similar to your data
arr = [
    [1.2, True, False, True],
    [2.3, False, True, True],
    [3.4, True, True, False]
]
genres = ['action', 'adventure', 'comedy']
df = pd.DataFrame(arr, columns=['rev'] + genres)

# perform your task
result = df.loc[:, genres]            # take just the genre columns
result = result.astype('int')         # convert boolean to int
result[result == 0] = np.nan          # convert 0's to nulls so they're excluded from the avg
result = result.mul(df.rev, axis=0)   # multiply by your 'rev' column
result = result.mean(axis=0)          # calc mean for all genres simultaneously

result
# action       2.30
# adventure    2.85
# comedy       1.75

Upvotes: 1

CGul
CGul

Reputation: 167

I've set up a basic example with two genres. I created a list, genre_mean, that will store the genre and its average revenue as a tuple. If you're just looking to store the average revenue you can adjust the genre_mean.append() statement.

movies = pd.DataFrame({"adj_rev": [10,20,30],
                       "action": ["TRUE", "FALSE", "TRUE"],
                       "comedy": ["FALSE", "TRUE", "FALSE"]})

genres = ["action", "comedy"]

genre_mean = []

for g in genres:
    g_mean = movies["adj_rev"][movies[g]=="TRUE"].mean()
    genre_mean.append((g, g_mean))

and the results

genre_mean
[('action', 20.0), ('comedy', 20.0)]

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153500

You can try something like this instead:

df = pd.DataFrame({'Revenue':np.arange(100,1001,100),
                   'action':np.random.choice([True, False],10),
                   'comedy':np.random.choice([True, False],10),
                   'drama':np.random.choice([True, False],10)})

df.iloc[:,1:].apply(lambda x: pd.Series(df.loc[x,'Revenue'])).mean()

Output:

action    500.0
comedy    587.5
drama     420.0
dtype: float64

Upvotes: 0

John Sloper
John Sloper

Reputation: 1821

You are trying to use the variable gen in the string, but you can't do it that way.

One fix would be changing the line to:

genrev.append(df.query(f'{gen} == True')['revenue_adj'].mean())

given that you are on python 3.6 or above. Older versions you could do:

genrev.append(df.query('%s == True' % gen)['revenue_adj'].mean())

You can also iterate over the columns directly instead of manually building the list:

for col in df.columns:
...

Upvotes: 2

Related Questions