Reputation: 61
I'm trying to iterate through column headers and populate a list containing the mean revenue for each category. I'm getting this error and I tried a bunch of different things trying to fix it. So my dataframe looks like this (looks like I dont have enough points to paste an image, but here is a link to it): each row in the dataframe represent an idividual movie. column revenue_adj populated with revenue values, one column for each genre populated with TRUE/FALSE indicating that the movie falls under that specific genre.
this code works fine, but I would like to do it in a for loop insteadgenrev1
genrev1= df.query('action == True')['revenue_adj'].mean()
genrev2 = df.query('adventure == True')['revenue_adj'].mean()
genrev3 = df.query('animation == True')['revenue_adj'].mean()
genrev4 = df.query('comedy == True')['revenue_adj'].mean()
genrev5 = df.query('crime == True')['revenue_adj'].mean()
genrev6 = df.query('documentary == True')['revenue_adj'].mean()
genrev7 = df.query('drama == True')['revenue_adj'].mean()
genrev8 = df.query('family == True')['revenue_adj'].mean()
I tried the following loop without success:
genheads = ['action', 'adventure', 'animation', 'comedy', 'crime', 'documentary', 'drama', 'family']
genres2 = genheads
genrev = []
for gen in genres2:
genrev.append(df.query('gen == True')['revenue_adj'].mean())
I get an error saying "gen is not defined" Instead of gen, I have tried:
df[gen] == True
'@df[gen] == True'
'@gen == True'
I been stuck on this for several weeks and any help on this would be soooo appreciated!!
Upvotes: 5
Views: 18538
Reputation: 1586
Welcome to stackoverflow!
I don't really see the need for the loop over the columns. Loops are typically an inefficient way to work on pandas dataframes and should be avoided if possible. With that in mind, I'll offer up a different solution. Apologies if you really DO need to loop over the columns. If you offer up a little more context in your question about why you've chosen the approach you have... where you've looked already and why that hasn't worked... it often helps guide people better to giving you more relevant answers.
Here's how I'd do it... without a loop.
import pandas as pd
import numpy as np
# mimick something similar to your data
arr = [
[1.2, True, False, True],
[2.3, False, True, True],
[3.4, True, True, False]
]
genres = ['action', 'adventure', 'comedy']
df = pd.DataFrame(arr, columns=['rev'] + genres)
# perform your task
result = df.loc[:, genres] # take just the genre columns
result = result.astype('int') # convert boolean to int
result[result == 0] = np.nan # convert 0's to nulls so they're excluded from the avg
result = result.mul(df.rev, axis=0) # multiply by your 'rev' column
result = result.mean(axis=0) # calc mean for all genres simultaneously
result
# action 2.30
# adventure 2.85
# comedy 1.75
Upvotes: 1
Reputation: 167
I've set up a basic example with two genres. I created a list, genre_mean, that will store the genre and its average revenue as a tuple. If you're just looking to store the average revenue you can adjust the genre_mean.append() statement.
movies = pd.DataFrame({"adj_rev": [10,20,30],
"action": ["TRUE", "FALSE", "TRUE"],
"comedy": ["FALSE", "TRUE", "FALSE"]})
genres = ["action", "comedy"]
genre_mean = []
for g in genres:
g_mean = movies["adj_rev"][movies[g]=="TRUE"].mean()
genre_mean.append((g, g_mean))
and the results
genre_mean
[('action', 20.0), ('comedy', 20.0)]
Upvotes: 0
Reputation: 153500
You can try something like this instead:
df = pd.DataFrame({'Revenue':np.arange(100,1001,100),
'action':np.random.choice([True, False],10),
'comedy':np.random.choice([True, False],10),
'drama':np.random.choice([True, False],10)})
df.iloc[:,1:].apply(lambda x: pd.Series(df.loc[x,'Revenue'])).mean()
Output:
action 500.0
comedy 587.5
drama 420.0
dtype: float64
Upvotes: 0
Reputation: 1821
You are trying to use the variable gen in the string, but you can't do it that way.
One fix would be changing the line to:
genrev.append(df.query(f'{gen} == True')['revenue_adj'].mean())
given that you are on python 3.6 or above. Older versions you could do:
genrev.append(df.query('%s == True' % gen)['revenue_adj'].mean())
You can also iterate over the columns directly instead of manually building the list:
for col in df.columns:
...
Upvotes: 2