con
con

Reputation: 6093

Cannot make violin plot with different length sub-lists

I'm attempting to make a violin plot with python 3.8.10 & matplotlib 3.3.4

import matplotlib.pyplot as plt
import numpy as np
data = []
data.append([65,46,64,59,42,44])
data.append([20,40,44,43,32,20,27,31,20,40,24,26,37,30,29,25,31,65,50,38,41,19,31,38,48,44,51,55,52,25,40,28,50,37,44,21,43,28,36,67,55,58,23,36,28,21,21,39,26,65,18,27,50,70,29,37,25,49,33,31,20,33])

f = plt.figure()
plt.rc('xtick', labelsize = 6)
violin_plot = plt.violinplot(data, showmeans=False, showmedians=False)
for pc in violin_plot["bodies"]:
    pc.set_edgecolor('black')
def adjacent_values(vals, q1, q3):
    upper_adjacent_value = q3 + (q3 - q1) * 1.5
    upper_adjacent_value = np.clip(upper_adjacent_value, q3, vals[-1])
    lower_adjacent_value = q1 - (q3 - q1) * 1.5
    lower_adjacent_value = np.clip(lower_adjacent_value, vals[0], q1)
    return lower_adjacent_value, upper_adjacent_value
quartile1, medians, quartile3 = np.percentile(data, [25, 50, 75], axis=1)
whiskers = np.array([
    adjacent_values(sorted_array, q1, q3)
    for sorted_array, q1, q3 in zip(data, quartile1, quartile3)])
whiskers_min, whiskers_max = whiskers[:, 0], whiskers[:, 1]
inds = np.arange(1, len(medians) + 1)
plt.scatter(inds, medians, marker="o", color="white", s=30, zorder=3)
plt.vlines(inds, quartile1, quartile3, color="k", linestyle="-", lw=5)
plt.vlines(inds, whiskers_min, whiskers_max, color="k", linestyle="-", lw=1)
plt.savefig('violin_age_by_race.svg', bbox_inches='tight', pad_inches = 0.05)

which I got from https://matplotlib.org/devdocs/gallery/statistics/customized_violin.html

but the code above generates an error (the line numbers are different than the code above because I trimmed the file down to make a minimal working example for StackOverflow)

/usr/local/lib/python3.8/dist-packages/numpy/core/_asarray.py:171: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  return array(a, dtype, copy=False, order=order, subok=True)
Traceback (most recent call last):
  File "/tmp/E2Woujgas1.py", line 35, in <module>
    quartile1, medians, quartile3 = np.percentile(data, [25, 50, 75], axis=1)
  File "<__array_function__ internals>", line 5, in percentile
  File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 3818, in percentile
    return _quantile_unchecked(
  File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 3937, in _quantile_unchecked
    r, k = _ureduce(a, func=_quantile_ureduce_func, q=q, axis=axis, out=out,
  File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 3495, in _ureduce
    axis = _nx.normalize_axis_tuple(axis, nd)
  File "/usr/local/lib/python3.8/dist-packages/numpy/core/numeric.py", line 1391, in normalize_axis_tuple
    axis = tuple([normalize_axis_index(ax, ndim, argname) for ax in axis])
  File "/usr/local/lib/python3.8/dist-packages/numpy/core/numeric.py", line 1391, in <listcomp>
    axis = tuple([normalize_axis_index(ax, ndim, argname) for ax in axis])
numpy.AxisError: axis 1 is out of bounds for array of dimension 1

the error is in quartile1, medians, quartile3 = np.percentile(data, [25, 50, 75], axis=1) so I do what the error message suggests, and change to

quartile1, medians, quartile3 = np.percentile(data, [25, 50, 75], axis=1, dtype = object)

but then I get an error:

TypeError: _percentile_dispatcher() got an unexpected keyword argument 'dtype'

as far as I can tell, the error is being thrown because the sub lists are different lengths, which is unavoidable. The example had all sub-lists with 100 elements.

I've also tried making an np array:

np_data = np.array(data, dtype = object)
quartile1, medians, quartile3 = np.percentile(np_data, [25, 50, 75], axis=1, dtype = object)

but the above changes give the same error about dtype

How can I alter this code so that numpy won't complain about different length sub-lists?

Upvotes: 0

Views: 382

Answers (1)

hpaulj
hpaulj

Reputation: 231385

The error isn't in the violinplot! That works just fine.

It's in the percentile function.

In [23]: np.percentile(data, [25, 50, 75], axis=1)
/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py:3539: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  a = np.asanyarray(a)
Traceback (most recent call last):
  File "<ipython-input-23-32c56e5bfa18>", line 1, in <module>
    np.percentile(data, [25, 50, 75], axis=1)
  File "<__array_function__ internals>", line 5, in percentile
  File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 3867, in percentile
    return _quantile_unchecked(
  File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 3986, in _quantile_unchecked
    r, k = _ureduce(a, func=_quantile_ureduce_func, q=q, axis=axis, out=out,
  File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 3544, in _ureduce
    axis = _nx.normalize_axis_tuple(axis, nd)
  File "/usr/local/lib/python3.8/dist-packages/numpy/core/numeric.py", line 1385, in normalize_axis_tuple
    axis = tuple([normalize_axis_index(ax, ndim, argname) for ax in axis])
  File "/usr/local/lib/python3.8/dist-packages/numpy/core/numeric.py", line 1385, in <listcomp>
    axis = tuple([normalize_axis_index(ax, ndim, argname) for ax in axis])
AxisError: axis 1 is out of bounds for array of dimension 1

data is a list. percentile needs an array, so:

In [25]: type(data)
Out[25]: list
In [26]: np.array(data)
<ipython-input-26-d04fee483c4a>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  np.array(data)
Out[26]: 
array([list([65, 46, 64, 59, 42, 44]),
       list([20, 40, 44, 43, 32, 20, 27, 31, 20, 40, 24, 26, 37, 30, 29, 25, 31, 65, 50, 38, 41, 19, 31, 38, 48, 44, 51, 55, 52, 25, 40, 28, 50, 37, 44, 21, 43, 28, 36, 67, 55, 58, 23, 36, 28, 21, 21, 39, 26, 65, 18, 27, 50, 70, 29, 37, 25, 49, 33, 31, 20, 33])],
      dtype=object)

So you can make an array from data without the warning:

In [30]: np_data=np.array(data, dtype=object)
In [31]: np_data
Out[31]: 
array([list([65, 46, 64, 59, 42, 44]),
       list([20, 40, 44, 43, 32, 20, 27, 31, 20, 40, 24, 26, 37, 30, 29, 25, 31, 65, 50, 38, 41, 19, 31, 38, 48, 44, 51, 55, 52, 25, 40, 28, 50, 37, 44, 21, 43, 28, 36, 67, 55, 58, 23, 36, 28, 21, 21, 39, 26, 65, 18, 27, 50, 70, 29, 37, 25, 49, 33, 31, 20, 33])],
      dtype=object)

But note, it is 1d, an array of lists. Specifying axis=1 is wrong because the array does not have such an axis.

Still, calling percentile on that array of lists still doesn't work:

In [32]: np.percentile(np_data, [25, 50, 75])
Traceback (most recent call last):
  File "<ipython-input-32-31dd33e64b74>", line 1, in <module>
    np.percentile(np_data, [25, 50, 75])
  File "<__array_function__ internals>", line 5, in percentile
 ....
 packages/numpy/lib/function_base.py", line 4009, in _lerp
    diff_b_a = subtract(b, a)
TypeError: unsupported operand type(s) for -: 'list' and 'list'

You could do percentile on the 2 lists separately:

In [34]: np.percentile(np_data[0], [25, 50, 75])
Out[34]: array([44.5 , 52.5 , 62.75])
In [35]: np.percentile(np_data[1], [25, 50, 75])
Out[35]: array([26.25, 34.5 , 44.  ])
In [36]: np.percentile(data[1], [25, 50, 75])
Out[36]: array([26.25, 34.5 , 44.  ])

Upvotes: 1

Related Questions