CDF Cumulative Distribution Function Error

Question

I am trying to plot a CDF for one column in multi-column data file. When only one column is present in data file it plots fine. When I try to grab a particular column from data it gives me error. I also tried using for loop to read a particular column it reads fine. If I give the plot statements out of for loop the plot is shown with only the last value of the column and if i keep the plot statement inside the loop is gives error. It is not the problem with reading a file or the particular column, not even indentation problem. How do i fix it ?

Code with for loop

import numpy as np
import matplotlib.pyplot as plt
from pylab import*
import math
from matplotlib.ticker import LogLocator

with open('input.txt', 'r') as f:
    for rows in f:
        cols = rows.split()
        data = cols[2]
        sorted_data = np.sort(data)
        cdf = np.arange(len(data))/float(len(data))
        plt.plot(sorted_data, cdf, '-bs')

plt.show()
#print data

Error

Traceback (most recent call last):
  File "cdf_plot.py", line 13, in 
    plt.plot(sorted_data, cdf, '-bs')
  File "/usr/lib/pymodules/python2.7/matplotlib/pyplot.py", line 2467, in plot
    ret = ax.plot(*args, **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 3893, in plot
    for line in self._get_lines(*args, **kwargs):
  File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 322, in _grab_next_args
    for seg in self._plot_args(remaining, kwargs):
  File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 300, in _plot_args
    x, y = self._xy_from_xy(x, y)
  File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 240, in _xy_from_xy
    raise ValueError("x and y must have same first dimension")
ValueError: x and y must have same first dimension

Code With no for loop:

import numpy as np
import matplotlib.pyplot as plt
from pylab import*
import math
from matplotlib.ticker import LogLocator

data = np.loadtxt('input.txt')
data_one = [row[2] for row in data]
sorted_data = np.sort(data)
cdf = np.arange(len(data_one))/float(len(data_one))
#cumulative = np.cumsum(data)
#ccdf = 1 - cdf

#plt.plot(data, sorted_data, 'r-*')
plt.plot(sorted_data, cdf, '-bs')

#plt.xlim([0,0.5])
plt.gca().set_xscale("log")
plt.gca().set_yscale("log")
plt.show()

Error:

Traceback (most recent call last):
  File "cum_graph.py", line 7, in 
    data = np.loadtxt('e_p_USC_30_days.txt')
  File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 804, in loadtxt
    X = np.array(X, dtype)
ValueError: setting an array element with a sequence.

Input file: I am interested in calculating the CDF of col[2] i.e. column 3 only

4814  2464  27  0.000627707861971  117923.0
4211  736  2  4.64968786645  05  2576.0
2075  1339  30  0.000697453179968  499822.0
2441  2381  3  6.97453179968  05  1968.0
4694  1738  1  2.32484393323  05  5702.0
4406  3008  12  0.000278981271987  8483.0
3622  1396  3  6.97453179968  05  2564.0
5425  478  1  2.32484393323  05  428.0
4489  1715  6  0.000139490635994  19045.0
3695  3387  2  4.64968786645  05  16195.0

user1940040 · Accepted Answer

Multiple things are really wrong here.

1 - The problem with your data

Look at them carefully:

4814  2464  27  0.000627707861971  117923.0
4211  736  2  4.64968786645  05  2576.0
2075  1339  30  0.000697453179968  499822.0
2441  2381  3  6.97453179968  05  1968.0
4694  1738  1  2.32484393323  05  5702.0
4406  3008  12  0.000278981271987  8483.0
3622  1396  3  6.97453179968  05  2564.0
5425  478  1  2.32484393323  05  428.0
4489  1715  6  0.000139490635994  19045.0
3695  3387  2  4.64968786645  05  16195.0

Sometimes you got 6 columns as in:

4211  736  2  4.64968786645  05  2576.0

and sometimes you only got 5:

4814  2464  27  0.000627707861971  117923.0

So the first thing is to learn how to write data correctly.

2 - Write the data correctly

Imagine that all you data are in a 2D numpy array called data.

You could call:

numpy.savetxt("input.txt", data)

or, to get more control over formating:

numpy.savetxt("input.txt", data, fmt="%d %d %d %.6f %d %.1f")

The fmt= parameter is a way to tell numpy how you want to save your data (%d means write it as an integer, %f means write it as a float, %.5f means write it as a float with only 5 decimals).

If you want to write it yourself, you could do something like:

fmt = "%d %d %d %.6f %d %.1f"
with open("input.txt", "w") as f:
    for row in data:
        f.write(fmt%row+"
")

If the lines with 5 columns instead of 6 are what you really want to write, then use another delimiter like ,. This way,

4814,2464,27,0.000627707861971,,117923.0

is obviously containing 6 columns.

3 - Loading valid data

What I call valid data is consistent data, data which always contains the same number of columns.

You should really use numpy.loadtxt or numpy.genfromtxt (the latter one is it use if data are missing). Note that you can specify a delimiter for both of them using the delimiter argument.

data = numpy.loadtxt("valid_input.txt")
col = data[:,2]

or equivalently you could use the usecols argument together with the unpack one.

4 - Loading invalid data

For your data, the method with usecols is working is you select only the third column (column 2 in Python lingua) if you don't have any other wrongness before column 2 elsewhere.

You could do it by hand which would bring us to another wrongness:

5 - The problems with your first implementation

There, you just replace the variable data with a single value (the one in cols[2]):

with open('input.txt', 'r') as f:
    for rows in f:
        cols = rows.split()
        data = cols[2]

There you try to sort a single value:

        sorted_data = np.sort(data)

There you want to get the length of a single value:

        cdf = np.arange(len(data))/float(len(data))
        plt.plot(sorted_data, cdf, '-bs')

plt.show()

I'm really surprised numpy does not complain.

You are getting one row at a time: you need to store these values somewhere (in a list for instance) and then do your stuff about it.

6 - The problem with your second implementation

numpy.loadtxt can't load your data (it tries to load everything by default) because it can't infer what you want to do with 6 columns or 5 columns depending on the row. So the only thing it can do is failing.

7 - The problem with you

First, don't get offended: what I'm gonna say is to help you improve. I'm not judging you in any way, just showing you how you should react in front of this kind of errors, trivial or not.

Read the errors.
Try to understand what's happening.
Look for those errors on the internet.
Ask someone.

The problem is that you seem to have just copy-pasted the errors without having actually looked at them so without having tried to understand them (but I may be wrong, I'm not in your head :)).

But what's for sure is that you have not copy-pasted them in your favorite search engine because answers are plenty. Again, I may be wrong. Maybe you did this but without seeing how these answers could apply to your case. Though, the first on Google answer about

ValueError: x and y must have same first dimension

is pretty explicit. You don't even have to mention this is matplotlib or Python. Then you would have discovered that sorted_data is not the same length as cdf. With a little more work, you would have figured out what I said before about your implementations.

8 - Prove me I'm wrong

As you've seen, I've not given a "canonical answer" and I won't since I consider that you have not done your part of the job. But you can still do it: I've given you all the tools you need to answer your own question. That don't mean that you have to do it all alone on a remote island: I've almost given a complete answer (really), the doc can help and Google too :). All you have to do is searching a tiny bit for it. Once you have something working, edit your question (or answer to your own question).