Reputation: 66
I am still new to python and need help setting up a dataframe. I have a data set with 5 competitors, 3 possible sentiments and a number called Adjusted Count. Here is a sample data file: https://www.sendspace.com/file/0xfh5s
I need to create a dataframe which has competitors by sentiment and sum of Adjusted Count.
It should look like this:
Positive Balanced Negative Grand Total
A 335 208 33 576
B 346 170 97 613
C 573 90 454 1117
D 1955 1456 1762 5173
E 253 0 189 442
Grand Total 3462 1924 2535 7921
But instead this is what i'm currently getting:
Positive Balanced Negative Total
A 0 0 0 0
B 0 0 0 0
C 0 0 0 0
D 0 0 0 0
E 0 0 0 0
Here is my current code:
PATH_DATA = '...filename.xlsx'
df1 = pd.read_excel(PATH_DATA, 'A')
df2 = pd.read_excel(PATH_DATA, 'B')
df3 = pd.read_excel(PATH_DATA, 'C')
df4 = pd.read_excel(PATH_DATA, 'D')
df5 = pd.read_excel(PATH_DATA, 'E')
df_ALL = pd.concat([df1, df2, df3, df4, df5])
SENTIMENT_DIMENTION = 'Sentiment simplified'
SENTIMENT_ORDER = ['Positive', 'Balanced', 'Negative']
COMPETITOR_DIMENTION = 'Competitor Name'
COMPETITOR_ORDER = ['A', 'B', 'C', 'D', 'E']
AC_DIMENTION = 'Adjusted Count'
class Report:
def ACbysentiment(self, vdimention, hdimention, indexlist):
data = self.data
data = data.groupby([vdimention, hdimention]).size().unstack(1)
#data = data.groupby('Adjusted Count')[sumdimention].agg('sum')
data = data.reindex(columns=indexlist)
data['Total'] = data.sum(axis=1)
data.fillna(0, inplace = True)
data = data.sort_values('Total', ascending=True)
return data
df_ALL = df_ALL.replace(np.nan, NAN_VAL, regex=True)
S1 = Report(df_ALL)
S1_Competitor_Sentiment_AC = S1.ACbysentiment(COMPETITOR_DIMENTION, AC_DIMENTION, SENTIMENT_ORDER)
S1_Competitor_Sentiment_AC = pd.DataFrame(S1_Competitor_Sentiment_AC, index = COMPETITOR_ORDER)
Any help you could give would be greatly appreciated!
Thanks
Here is a sample of what df_ALL could look like:
Competitor Sentiment simplified Adjusted Count
A Positive 50
A Balanced 40
A Negative 30
A Positive 10
B Balanced 50
B Negative 40
B Positive 30
B Balanced 10
B Negative 50
B Positive 40
C Balanced 30
C Negative 10
C Positive 50
C Balanced 40
C Negative 30
D Positive 10
D Balanced 50
D Negative 40
E Positive 30
E Balanced 10
E Negative 50
E Positive 40
E Balanced 30
E Negative 10
E Positive 50
Upvotes: 0
Views: 95
Reputation: 46
GroupBy is your friend here. I didn't find any links to the excel file. I am assuming that you have a sheet for competitor with the sentiment information (i guess 1's and 0's). First thing you need to do is add the competitor name. You can do this in 2 ways, either add it in the excel sheet itself, or after you have read the excel sheet like this:
df1 = pd.read_excel(PATH_DATA, 'A')
df1['competitor'] = 'A'
This will add a column called competitor with value 'A'. In your case the competitor value is essential because you will aggregate on this value. Once you have done this for all the data frames only then you should concatenate them:
df_ALL = pd.concat([df1, df2, df3, df4, df5])
After this you can apply groupby on the dataframe:
df_grouped = df_All.groupby(by='competitor',as_index = False).sum()
EDIT: Class based solution
class Report:
def __init__(self,dataframe):
self.data = dataframe
def ACbysentiment(self, vdimention, hdimention):
data = self.data
data = data.groupby(by=[vdimention, hdimention],as_index=False).sum()
data = data.pivot(index='Competitor',columns = 'Sentiment simplified',values='Adjusted Count')
data['Total'] = data.sum(axis=1)
#data.fillna(0, inplace = True) not required as we are aggregating
data = data.sort_values('Total', ascending=True)
return data
Note:The init method is required to pass the dataframe to the class. Your current code throws an exception.
Also note that i have used groupby with as_index = False. This allows a sql style groupby that you are looking for. So you can replace:
data = data.groupby([vdimention, hdimention]).size().unstack(1)
with
data = data.groupby(by=[vdimention, hdimention],as_index=False).sum()
Finally, if I understand correctly, you want to (in excel terms) pivot the data so for that we use the df.Pivot() method:
data = data.pivot(index='Competitor',columns = 'Sentiment simplified',values='Adjusted Count')
I also think that you don't need the various 'dimensions' that you have defined (unless they are used elsewhere in your code). You just need two:
COMPETITOR_DIMENTION = 'Competitor' # just the name of the column,case sensitive
AC_DIMENTION = 'Sentiment simplified'
and then:
S1_Competitor_Sentiment_AC = S1.ACbysentiment(COMPETITOR_DIMENTION, AC_DIMENTION)
This will return a pandas dataframe with your desired output.
Hope that helps!
Upvotes: 1