Followup to a previous question regarding data analysis with pandas. I now want to plot my data, which looks like this: PrEST ID Gene Sequence Ratio1 Ratio2 Ratio3 HPRR12 ATF1 TTPSAXXXXXXXXXTTTK 6.3222 4.0558 4.958 HPRR23 CREB1 KIXXXXXXXXPGVPR NaN NaN NaN HPRR23 CREB1 ILNXXXXXXXXGVPR 0.22691 2.077 NaN HPRR15 ELK4 IEGDCEXXXXXXXGGK 1.177 NaN 12.073 HPRR15 ELK4 SPXXXXXXXXXXXSVIK 8.66 14.755 NaN HPRR15 ELK4 IEGDCXXXXXXXVSSSSK 15.745 7.9122 9.5966 ... except there are a bunch more rows, and I don't actually want to plot the ratios but some other calculated values derived from them, but it doesn't matter for my plotting problem. I have a dataframe that looks more or less like that data above, and what I want is this: Each row (3 ratios) should be plotted against the row's ID, as points All rows with the same ID should be plotted to the same x value / ID, but with another colour The x ticks should be the IDs, and (if possible) the corresponding gene as well (so some genes will appear on several x ticks, as they have multiple IDs mapping to them) Below is an image that my previous, non-pandas version of this script produces: ... where the red triangles indicate values outside of a cutoff value used for setting the y-axis maximum value. The IDs are blacked-out, but you should be able to see what I'm after. Copy number is essentially the ratios with a calculation on top of them, so they're just another number rather than the ones I show in the data above. I have tried to find similar questions and solutions in the documentation, but found none. Most people seem to need to do this with dates, for which there seem to be ready-made plotting functions, which doesn't help me (I think). Any help greatly appreciated!

Reputation: 4577

Pandas: plot multiple columns to same x value

Followup to a previous question regarding data analysis with pandas. I now want to plot my data, which looks like this:

PrEST ID    Gene    Sequence        Ratio1    Ratio2    Ratio3
HPRR12  ATF1    TTPSAXXXXXXXXXTTTK  6.3222    4.0558    4.958   
HPRR23  CREB1   KIXXXXXXXXPGVPR     NaN       NaN       NaN     
HPRR23  CREB1   ILNXXXXXXXXGVPR     0.22691   2.077     NaN
HPRR15  ELK4    IEGDCEXXXXXXXGGK    1.177     NaN       12.073  
HPRR15  ELK4    SPXXXXXXXXXXXSVIK   8.66      14.755    NaN
HPRR15  ELK4    IEGDCXXXXXXXVSSSSK  15.745    7.9122    9.5966

... except there are a bunch more rows, and I don't actually want to plot the ratios but some other calculated values derived from them, but it doesn't matter for my plotting problem. I have a dataframe that looks more or less like that data above, and what I want is this:

Each row (3 ratios) should be plotted against the row's ID, as points
All rows with the same ID should be plotted to the same x value / ID, but with another colour
The x ticks should be the IDs, and (if possible) the corresponding gene as well (so some genes will appear on several x ticks, as they have multiple IDs mapping to them)

Below is an image that my previous, non-pandas version of this script produces:

enter image description here

... where the red triangles indicate values outside of a cutoff value used for setting the y-axis maximum value. The IDs are blacked-out, but you should be able to see what I'm after. Copy number is essentially the ratios with a calculation on top of them, so they're just another number rather than the ones I show in the data above.

I have tried to find similar questions and solutions in the documentation, but found none. Most people seem to need to do this with dates, for which there seem to be ready-made plotting functions, which doesn't help me (I think). Any help greatly appreciated!

Upvotes: 8

Answers (3)

szeitlin

Reputation: 3341

I have had similar problems. I think the issue you're having with mismatched labels & markers is because of how you're iterating through the data.

Suggestions for getting pandas to work:

As other people mentioned, I always start by double-checking data types. Make sure you don't have any rows with strange things in them (NaNs, symbols, or other missing values, will often cause this type of error with plotting packages).

Drop NAs if you haven't already, then explicitly convert whole columns to the appropriate dtype as needed.

In pandas, an 'object' is not the same as a 'string', and some of the plotting packages don't like 'objects' (see below).

I have also run into strange problems sometimes if my index wasn't continuous (if you drop NAs, you may have to reindex), or if my x-axis values weren't pre-sorted.

(Note that matplotlib prefers numbers, but other plotting packages can handle categorical data in ways that will make your life a lot easier.)

Lately I am using seaborn, which doesn't seem to have the same kinds of problems with 'objects'. Specifically, you might want to take a look at seaborn's factorplot. Seaborn also has easy options for color palettes, so that might solve more than one of these issues for you.

Some pandas tricks you might want to try, if you haven't already:

converting your code objects explicitly to strings:

df['code_as_word'] = df['secretcodenumber'].astype(str)

Or drop the letters, as you suggested, and convert objects to numeric instead:

df = df.convert_objects(convert_numeric=True)

Upvotes: 0

Noah Hafner

Reputation: 351

Skipping some of the finer points of plotting, to get:

Each row (3 ratios) should be plotted against the row's ID, as points
All rows with the same ID should be plotted to the same x value / ID, but with another colour
The x ticks should be the IDs, and (if possible) the corresponding gene as well (so some genes will appear on several x ticks, as they have multiple IDs mapping to them)

I suggest you try using matplotlib to handle the plotting, and manually cycle the colors. You can use something like:

import matplotlib.pyplot as plt
import pandas as pd
import itertools
#data
df = pd.DataFrame(
    {'id': [1, 2, 3, 3],
     'labels': ['HPRR1234', 'HPRR4321', 'HPRR2345', 'HPRR2345'],
     'g': ['KRAS', 'KRAS', 'ELK4', 'ELK4'],
     'r1': [15, 9, 15, 1],
     'r2': [14, 8, 7, 0],
     'r3': [14, 16, 9, 12]})
#extra setup
plt.rcParams['xtick.major.pad'] = 8
#plotting style(s)
marker = itertools.cycle((',', '+', '.', 'o', '*'))
color = itertools.cycle(('b', 'g', 'r', 'c', 'm', 'y', 'k'))
#plot
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(df['id'], df['r1'], ls='', ms=10, mew=2,
        marker=marker.next(), color=color.next())
ax.plot(df['id'], df['r2'], ls='', ms=10, mew=2,
        marker=marker.next(), color=color.next())
ax.plot(df['id'], df['r3'], ls='', ms=10, mew=2,
        marker=marker.next(), color=color.next())
# set the tick labels
ax.xaxis.set_ticks(df['id'])
ax.xaxis.set_ticklabels(df['labels'])
plt.setp(ax.get_xticklabels(), rotation='vertical', fontsize=12)
plt.tight_layout()
fig.savefig("example.pdf")

If you have many rows, you will probably want more colors, but this shows at least the concept.

Upvotes: 6

erikfas

Reputation: 4577

I managed to find a way to keep the string names! I thought about what you said about finding numbers for the IDs and figured I could use the index, which worked just fine.

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(df.index,df['r1'], ls='', marker=marker.next(), color=next(color))
ax.plot(df.index,df['r2'], ls='', marker=marker.next(), color=next(color))
ax.plot(df.index,df['r3'], ls='', marker=marker.next(), color=next(color))

ax.xaxis.set_ticks(df.index)
ax.xaxis.set_ticklabels(df['g'])

Now I've got some other problems, though. I did not realise it until now, but while plotting as above DOES work, it's not exactly in the way I wanted it. Doing it like this will give me three values per ID x tick, and then the plotting continues beyond the x-axis limits, with three more values per tick (although there are not more ticks). It looks like this:

Weird plot beyond x ticks

What is wrong here, and why won't all the values map to the correct ID?

Upvotes: 0

Pandas: plot multiple columns to same x value

Answers (3)

Related Questions