ZenenT
ZenenT

Reputation: 45

Plotting boolean frequency against qualitative data in pandas

I'll start off by saying that I'm not really talented in statistical analysis. I have a dataset stored in a .csv file that I'm looking to represent graphically. What I'm trying to represent is the frequency of survival (represented for each person as a 0 or 1 in the Survived column) for each unique entry in the other columns.

For example: one of the other columns, Class, holds one of three possible values (1, 2, or 3). I want to graph the probability that someone from Class 1 survives versus Class 2 versus Class 3, so that I can visually determine whether or not class is correlated to survival rate.

I've attached the snippet of code that I've developed so far, but I'd understand if everything I'm doing is wrong because I've never used pandas before.

  1 import pandas as pd
  2 import matplotlib.pyplot as plt
  3 
  4 df = pd.read_csv('train.csv')
  5 
  6 print(list(df)[2:]) # slicing first 2 values of "ID" and "Survived"
  7 
  8 for column in list(df)[2:]:
  9     try:
 10         df.plot(x='Survived',y=column,kind='hist')
 11     except TypeError:
 12         print("Column {} not usable.".format(column))
 13 
 14 plt.show()

EDIT: I've attached a small segment of the dataframe below

     PassengerId  Survived  Pclass                                               Name  ...            Ticket      Fare        Cabin  Embarked  
0              1         0       3                            Braund, Mr. Owen Harris  ...         A/5 21171    7.2500          NaN         S  
1              2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  ...          PC 17599   71.2833          C85         C  
2              3         1       3                             Heikkinen, Miss. Laina  ...  STON/O2. 3101282    7.9250          NaN         S  
3              4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  ...            113803   53.1000         C123         S  
4              5         0       3                           Allen, Mr. William Henry  ...            373450    8.0500          NaN         S  
5              6         0       3                                   Moran, Mr. James  ...            330877    8.4583          NaN         Q 

Upvotes: 0

Views: 486

Answers (2)

10101010
10101010

Reputation: 1821

Adding to the answer, here is a simple bar graph.

result = df.groupby('Pclass')['Survived'].mean()

result.plot(kind='bar', rot=1, ylim=(0, 1))

enter image description here

Upvotes: 1

gmds
gmds

Reputation: 19885

I think you want this:

df.groupby('Pclass')['Survived'].mean()

This separates the dataframe into three groups based on the three unique values of Pclass. It then takes the mean of Survived, which is equal to the number of 1 values divided by the number of values total. This would produce a dataframe looking something like this:

Pclass
1    0.558824
2    0.636364
3    0.696970

It is then trivial from there to plot a bar graph with .plot.bar() if you wish.

Upvotes: 1

Related Questions