Reputation: 393
I am trying to plot a line chart of a large data set where I want to set the y to a"count"-value.
This is a mock df :
my = pd.DataFrame(np.array(
[['Apple', 1],
['Kiwi', 2],
['Clementine', 3],
['Kiwi', 1],
['Banana', 2],
['Clementine', 3],
['Apple', 1],
['Kiwi', 2]]),
columns=['fruit', 'cheers'])
I would like the plot to use the 'cheers' as the x and then have one line for each 'fruit' and the number of times 'cheers'
EDIT: Line graph might not be the best pursuit, please do advise me then. I would like something like this:
In the big data set there would maybe one but not several "zeros", maybe I should've made a bigger mock df.
Upvotes: 1
Views: 6440
Reputation: 39062
An alternate way to get exactly the figure you posted which starts the curves from 0 is following. The idea is to count the frequency of occurrence of each fruit for different cheers and then make use of dictionaries.
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Define the dataframe here
# my = pd.DataFrame(...)
cheers = np.array(my['cheers'])
for fr in np.unique(my['fruit']):
freqs = Counter(cheers[np.argwhere(my['fruit']==fr)].flatten()) # Count the frequency
init_dict = {'0': 0}
init_dict.update({i: 0 for i in np.unique(cheers)}) # Initialize the dictionary with 0 values
for k, v in freqs.items():
init_dict[k] = v # Update the values of cheers
plt.plot(init_dict.keys(), init_dict.values(), '-o', label=fr) # Plot each fruit line
plt.legend()
plt.yticks(range(4))
plt.show()
Upvotes: 1
Reputation: 36
The code below will plot a line for each 'fruit' where the x
coordinate is the number of 'cheers' and the y
coordinate is the cheers counts per fruit.
First, the dataframe is grouped by fruit to get the list of cheers per fruit. Next, a histogram is computed and plotted for each list of cheers. The max_cheers_count is used in order to ensure the same x coordinates for all plotted lines.
Note: see @Heike's answer below for a more pythonic solution.
import matplotlib.pyplot as plt
import numpy as np
# convert 'cheers' column to int
my.cheers = my['cheers'].astype(int)
# computes maximal cheers value, to use later for the histogram
max_cheers_count = my['cheers'].max()
# get cheer counts per fruit
cheer_counts = my.groupby('fruit').apply(lambda x: x['cheers'].values)
# for each fruit compute histogram of cheer counts and plot it
plt.figure()
for row in cheer_counts.iteritems():
histogram = np.histogram(a=row[1], bins=range(1,max_cheers_count+2))
plt.plot(histogram[1][:-1], histogram[0], marker='o', label=row[0])
plt.xlabel('cheers')
plt.ylabel('counts')
plt.legend()
Upvotes: 1
Reputation: 24420
I see you already accepted an answer, but an alternative way to do this is something like
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
my = pd.DataFrame(np.array([['Apple', 1],
['Kiwi', 2],
['Clementine', 3],
['Kiwi', 1],
['Banana', 2],
['Clementine', 3],
['Apple', 1],
['Kiwi', 2]]),
columns=['fruit', 'cheers'])
my_pivot = my.pivot_table(index = 'cheers',
columns = 'fruit',
fill_value = 0,
aggfunc={'fruit':len})['fruit']
my_pivot.plot.line()
plt.tight_layout()
plt.show()
Output:
Upvotes: 2
Reputation: 10880
my.groupby('fruit').sum().plot.barh()
Note that your example dataframe appears to have the numbers represented as string
type, so you might change that to int
before with
my.cheers = my.cheers.astype(int)
Afaics this because of your initialization of the dataframe via a 2D-array.
You can avoid this by using the dictionary approach to create a dataframe:
my = pd.DataFrame(
{'fruit': ['Apple', 'Kiwi', 'Clementine', 'Kiwi', 'Banana', 'Clementine', 'Apple', 'Kiwi'],
'cheers': [1, 2, 3, 1, 2, 3, 1, 2]})
Upvotes: 1