Joehat
Joehat

Reputation: 1129

How to plot the index column in pandas/matplotlib?

I have the first column of my data frame as the meaningful index. I which to plot that column as my x-axis. However, I am struggling in doing so as I keep receiving the error:

"None of [Float64Index([1992.9595, 1992.9866, 1993.0138, 1993.0409, 1993.0681, 1993.0952,\n 1993.1223, 1993.1495, 1993.1766, 1993.2038,\n ...\n 2002.7328, 2002.7599, 2002.7871, 2002.8142, 2002.8414, 2002.8685,\n 2002.8957, 2002.9228, 2002.95, 2002.9771],\n dtype='float64', name='Time', length=340)] are in the [columns]"

I've tried using x=df_topex.index as suggested in another forum question (linked below) but this does not seem to work for me. I was wondering if someone could explain to me why and how I can achieve the plotting.

df_topex = pd.read_csv('datasets/TOPEX.dat', 
                       sep='\s+', #multiple spaces as separator
                       index_col=0, #convert first column to index
                       names=["Time", "Anomaly"], #naming the headers
                      )

df_topex.plot(kind='scatter', x=df_topex.index, y='Anomaly', color='red')
plt.show()

The other question: Use index in pandas to plot data

Upvotes: 2

Views: 8977

Answers (1)

Marcos
Marcos

Reputation: 866

I modify my answer with your feedback reproducing more accurate the issue.

With this:

df_topex = pd.read_csv('datasets/TOPEX.dat', 
                       sep='\s+', #multiple spaces as separator
                       index_col=0, #convert first column to index
                       names=["Time", "Anomaly"], #naming the headers
                      )

You've got something like this, where the column "Time" is the index:

    Time    Anomaly
---------  ---------
1992.9595     2.0000
1992.9866     3.0000
1993.0138     4.0000
1993.0409     5.0000
1993.0681     6.0000
1993.0952     7.0000

To plot it, we can do the following as you say, but just fyi there is an issue with this method (https://github.com/pandas-dev/pandas/issues/16529 but for now not a big deal):

df_topex.reset_index(inplace=True)
tabulate_df(df_topex)

It could be safer:

df_topex = df_topex.reset_index()

Anyway, we have "Time" as column ready to be used in a plot (I point that "Time" seems to me not having time format):

            Time    Anomaly
------  ---------  ---------
     0  1992.9595     2.0000
     1  1992.9866     3.0000
     2  1993.0138     4.0000
     3  1993.0409     5.0000
     4  1993.0681     6.0000
     5  1993.0952     7.0000

To plot it:

df_topex.plot(kind='scatter', x='Time', y='Anomaly', color='red')

Then let's think following your last question: well... We've got the plot, but now we can't make use of the advantages of using "Time" as index, isn't it?

Index have significative performance impact when filtering millions of rows. Maybe you are interested in use "Time" column as index because you have or foresee high volumen. Plotting million of points can be done (data shading for example) but is not very common. Filtering any DataFrame before plotting it is quite common, and at that point, having indexed the column to filter can really help, after that normally comes the plot.

So we can work in phases with different DataFrames, or altogether doing the following after the csv import operation, that is, keeping the index to play with it and plot over the Time2 column at any time:

df_topex['Time2'] = df_topex.index

So we keep "Time" as index:

    Time    Anomaly      Time2
---------  ---------  ---------
1992.9595     2.0000  1992.9595
1992.9866     3.0000  1992.9866
1993.0138     4.0000  1993.0138
1993.0409     5.0000  1993.0409
1993.0681     6.0000  1993.0681
1993.0952     7.0000  1993.0952

How to take advantage of indexing? Nice post in which mensures the performance on filtering over the index: What is the performance impact of non-unique indexes in pandas?

In short, you're interested in having a unique index or at least sorted.

# Performance preference in index type to filtering tasks: 
# 1) unique
# 2) if not unique, at least sorted (monotonic increase o decrease)
# 3) Worst combination: non-unique and unsorted.

# Let's check:
print ("Is unique?", df_topex.index.is_unique)
print ("Is is_monotonic increasing?", df_topex.index.is_monotonic_increasing)
print ("Is is_monotonic decreasing?", df_topex.index.is_monotonic_decreasing)

From the sample data:

Is unique? True
Is is_monotonic increasing? True
Is is_monotonic decreasing? False

If not sorted, you can perform the ordering task by:

df_topex = df_topex.sort_index()
# Ready to go on filtering...

Hope it helps.

Upvotes: 3

Related Questions