Pandas Dataframes

Question

I am still new to python and need help setting up a dataframe. I have a data set with 5 competitors, 3 possible sentiments and a number called Adjusted Count. Here is a sample data file: https://www.sendspace.com/file/0xfh5s

I need to create a dataframe which has competitors by sentiment and sum of Adjusted Count.

It should look like this:

            Positive    Balanced    Negative    Grand Total
A           335         208         33          576
B           346         170         97          613
C           573         90          454         1117
D           1955        1456        1762        5173
E           253         0           189         442
Grand Total 3462        1924        2535        7921

But instead this is what i'm currently getting:

            Positive    Balanced    Negative    Total
A           0           0           0           0
B           0           0           0           0
C           0           0           0           0
D           0           0           0           0
E           0           0           0           0

Here is my current code:

PATH_DATA = '...filename.xlsx'

df1 = pd.read_excel(PATH_DATA, 'A')
df2 = pd.read_excel(PATH_DATA, 'B')
df3 = pd.read_excel(PATH_DATA, 'C')
df4 = pd.read_excel(PATH_DATA, 'D')
df5 = pd.read_excel(PATH_DATA, 'E')
df_ALL = pd.concat([df1, df2, df3, df4, df5]) 

SENTIMENT_DIMENTION = 'Sentiment simplified'
SENTIMENT_ORDER = ['Positive', 'Balanced', 'Negative']
COMPETITOR_DIMENTION = 'Competitor Name'
COMPETITOR_ORDER = ['A', 'B', 'C', 'D', 'E']
AC_DIMENTION = 'Adjusted Count'

class Report:
    def ACbysentiment(self, vdimention, hdimention, indexlist):
        data = self.data
        data = data.groupby([vdimention, hdimention]).size().unstack(1)
        #data = data.groupby('Adjusted Count')[sumdimention].agg('sum')
        data = data.reindex(columns=indexlist)
        data['Total'] = data.sum(axis=1)
        data.fillna(0, inplace = True)
        data = data.sort_values('Total', ascending=True)
        return data

df_ALL = df_ALL.replace(np.nan, NAN_VAL, regex=True)
S1 = Report(df_ALL)

S1_Competitor_Sentiment_AC = S1.ACbysentiment(COMPETITOR_DIMENTION, AC_DIMENTION, SENTIMENT_ORDER)
S1_Competitor_Sentiment_AC = pd.DataFrame(S1_Competitor_Sentiment_AC, index = COMPETITOR_ORDER)

Any help you could give would be greatly appreciated!

Thanks

Here is a sample of what df_ALL could look like:

Competitor  Sentiment simplified    Adjusted Count
A           Positive                50
A           Balanced                40
A           Negative                30
A           Positive                10
B           Balanced                50
B           Negative                40
B           Positive                30
B           Balanced                10
B           Negative                50
B           Positive                40
C           Balanced                30
C           Negative                10
C           Positive                50
C           Balanced                40
C           Negative                30
D           Positive                10
D           Balanced                50
D           Negative                40
E           Positive                30
E           Balanced                10
E           Negative                50
E           Positive                40
E           Balanced                30
E           Negative                10
E           Positive                50

iamjoebloggs · Accepted Answer

GroupBy is your friend here. I didn't find any links to the excel file. I am assuming that you have a sheet for competitor with the sentiment information (i guess 1's and 0's). First thing you need to do is add the competitor name. You can do this in 2 ways, either add it in the excel sheet itself, or after you have read the excel sheet like this:

df1 = pd.read_excel(PATH_DATA, 'A')
df1['competitor'] = 'A'

This will add a column called competitor with value 'A'. In your case the competitor value is essential because you will aggregate on this value. Once you have done this for all the data frames only then you should concatenate them:

df_ALL = pd.concat([df1, df2, df3, df4, df5])

After this you can apply groupby on the dataframe:

df_grouped = df_All.groupby(by='competitor',as_index = False).sum()

EDIT: Class based solution

class Report:
    def __init__(self,dataframe):
        self.data = dataframe
    def ACbysentiment(self, vdimention, hdimention):
        data = self.data
        data = data.groupby(by=[vdimention, hdimention],as_index=False).sum()
        data = data.pivot(index='Competitor',columns = 'Sentiment simplified',values='Adjusted Count')
        data['Total'] = data.sum(axis=1) 
        #data.fillna(0, inplace = True) not required as we are aggregating
        data = data.sort_values('Total', ascending=True)
        return data

Note:The init method is required to pass the dataframe to the class. Your current code throws an exception.

Also note that i have used groupby with as_index = False. This allows a sql style groupby that you are looking for. So you can replace:

data = data.groupby([vdimention, hdimention]).size().unstack(1)

with

data = data.groupby(by=[vdimention, hdimention],as_index=False).sum()

Finally, if I understand correctly, you want to (in excel terms) pivot the data so for that we use the df.Pivot() method:

data = data.pivot(index='Competitor',columns = 'Sentiment simplified',values='Adjusted Count')

I also think that you don't need the various 'dimensions' that you have defined (unless they are used elsewhere in your code). You just need two:

COMPETITOR_DIMENTION = 'Competitor' # just the name of the column,case sensitive
AC_DIMENTION = 'Sentiment simplified'

and then:

S1_Competitor_Sentiment_AC = S1.ACbysentiment(COMPETITOR_DIMENTION, AC_DIMENTION)

This will return a pandas dataframe with your desired output.

Hope that helps!

Pandas Dataframes

Answers (1)

Related Questions