Reputation: 91
I have the following snippet that I would like to extend in a way that the data from each loop gets plotted on the same canvas instead of for each loop to a different one.
for level in range(len(result)):
sizes = result[level].values()
distribution=pd.DataFrame(Counter(sizes).items(), columns=['community size','number of communities'])
distribution.plot(kind='scatter', x='community size', y='number of communities')
In the optimal case I also additionally would like to have the dots in the scatterplot color-coded according to the original data (Dots belonging to the data from one loop colored in the same color).
I am more or less new to both matplotlib and pandas, so andy help is highly appreciated.
Upvotes: 2
Views: 801
Reputation: 879113
Instead of calling plot
many times, you could build the entire data set as one
DataFrame and then you would only need to call plot
once.
Starting with
result = [{0: 21, 1: 7, 2: 67, 3: 12, 4: 15, 5: 7, 6: 54, 7: 49, 8: 50, 9: 31,
10: 6, 11: 2, 12: 8, 13: 2, 14: 2, 15: 1, 16: 35, 17: 2, 18: 1, 19:
4, 20: 2, 21: 4, 22: 3, 23: 1, 24: 1, 25: 1, 26: 1, 27: 1, 28: 1,
29: 1},
{0: 2, 1: 5, 2: 2, 3: 3, 4: 1, 5: 2, 6: 3, 7: 2, 8: 1, 9: 1, 10: 1,
11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1}]
you could build a DataFrame with columns level
and size
:
df = pd.DataFrame([(level,val) for level, dct in enumerate(result)
for val in dct.values()],
columns=['level', 'size'])
which looks like this:
level size
0 0 21
1 0 7
2 0 67
...
45 1 1
46 1 1
47 1 1
Now we can group by the level, and count how many items of each size
there are in each group:
size_count = df.groupby(['level'])['size'].apply(lambda x: x.value_counts())
# level
# 0 1 9
# 2 5
# 7 2
# ...
# 1 1 11
# 2 4
# 3 2
# 5 1
# dtype: int64
The groupby/apply
above returns a pd.Series
. To make this a DataFrame, we can make the index level values into columns by calling reset_index()
, and then assign column names to the columns:
size_count = size_count.reset_index()
size_count.columns = ['level', 'community size', 'number of communities']
Now the desired plot can be generated with
size_count.plot(kind='scatter', x='community size', y='number of communities',
s=100, c='level')
s=100
controls the size of the dots, c='level'
tells plot
to color the dots according the value in the level
column.
import pandas as pd
import matplotlib.pyplot as plt
result = [{0: 21, 1: 7, 2: 67, 3: 12, 4: 15, 5: 7, 6: 54, 7: 49, 8: 50, 9: 31,
10: 6, 11: 2, 12: 8, 13: 2, 14: 2, 15: 1, 16: 35, 17: 2, 18: 1, 19:
4, 20: 2, 21: 4, 22: 3, 23: 1, 24: 1, 25: 1, 26: 1, 27: 1, 28: 1,
29: 1},
{0: 2, 1: 5, 2: 2, 3: 3, 4: 1, 5: 2, 6: 3, 7: 2, 8: 1, 9: 1, 10: 1,
11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1}]
df = pd.DataFrame([(level,val) for level, dct in enumerate(result)
for val in dct.values()],
columns=['level', 'size'])
size_count = df.groupby(['level'])['size'].apply(lambda x: x.value_counts())
size_count = size_count.reset_index()
size_count.columns = ['level', 'community size', 'number of communities']
cmap = plt.get_cmap('jet')
size_count.plot(kind='scatter', x='community size', y='number of communities',
s=100, c='level', cmap=cmap)
plt.show()
Using a colorbar might be appropriate if there are dozens of levels.
On the other hand, if there are only a few levels, using a legend would make
more sense. In that case, it is more convenient to call plot
once for each
level value, since the matplotlib code which makes the legend is set up to make
one legend entry per plot:
import pandas as pd
import matplotlib.pyplot as plt
result = [{0: 21, 1: 7, 2: 67, 3: 12, 4: 15, 5: 7, 6: 54, 7: 49, 8: 50, 9: 31,
10: 6, 11: 2, 12: 8, 13: 2, 14: 2, 15: 1, 16: 35, 17: 2, 18: 1, 19:
4, 20: 2, 21: 4, 22: 3, 23: 1, 24: 1, 25: 1, 26: 1, 27: 1, 28: 1,
29: 1},
{0: 2, 1: 5, 2: 2, 3: 3, 4: 1, 5: 2, 6: 3, 7: 2, 8: 1, 9: 1, 10: 1,
11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1}]
df = pd.DataFrame([(level,val) for level, dct in enumerate(result)
for val in dct.values()],
columns=['level', 'size'])
groups = df.groupby(['level'])
fig, ax = plt.subplots()
for level, grp in groups:
size_count = grp['size'].value_counts()
ax.plot(size_count.index, size_count, markersize=12, marker='o',
linestyle='', label='level {}'.format(level))
ax.legend(loc='best', numpoints=1)
ax.set_xlabel('community size')
ax.set_ylabel('number of communities')
ax.grid(True)
plt.show()
Upvotes: 1