Reputation:
I am trying to plot a CDF for one column in multi-column data file. When only one column is present in data file it plots fine. When I try to grab a particular column from data it gives me error. I also tried using for loop to read a particular column it reads fine. If I give the plot statements out of for loop the plot is shown with only the last value of the column and if i keep the plot statement inside the loop is gives error. It is not the problem with reading a file or the particular column, not even indentation problem. How do i fix it ?
Code with for loop
import numpy as np
import matplotlib.pyplot as plt
from pylab import*
import math
from matplotlib.ticker import LogLocator
with open('input.txt', 'r') as f:
for rows in f:
cols = rows.split()
data = cols[2]
sorted_data = np.sort(data)
cdf = np.arange(len(data))/float(len(data))
plt.plot(sorted_data, cdf, '-bs')
plt.show()
#print data
Error
Traceback (most recent call last):
File "cdf_plot.py", line 13, in <module>
plt.plot(sorted_data, cdf, '-bs')
File "/usr/lib/pymodules/python2.7/matplotlib/pyplot.py", line 2467, in plot
ret = ax.plot(*args, **kwargs)
File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 3893, in plot
for line in self._get_lines(*args, **kwargs):
File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 322, in _grab_next_args
for seg in self._plot_args(remaining, kwargs):
File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 300, in _plot_args
x, y = self._xy_from_xy(x, y)
File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 240, in _xy_from_xy
raise ValueError("x and y must have same first dimension")
ValueError: x and y must have same first dimension
Code With no for loop:
import numpy as np
import matplotlib.pyplot as plt
from pylab import*
import math
from matplotlib.ticker import LogLocator
data = np.loadtxt('input.txt')
data_one = [row[2] for row in data]
sorted_data = np.sort(data)
cdf = np.arange(len(data_one))/float(len(data_one))
#cumulative = np.cumsum(data)
#ccdf = 1 - cdf
#plt.plot(data, sorted_data, 'r-*')
plt.plot(sorted_data, cdf, '-bs')
#plt.xlim([0,0.5])
plt.gca().set_xscale("log")
plt.gca().set_yscale("log")
plt.show()
Error:
Traceback (most recent call last):
File "cum_graph.py", line 7, in <module>
data = np.loadtxt('e_p_USC_30_days.txt')
File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 804, in loadtxt
X = np.array(X, dtype)
ValueError: setting an array element with a sequence.
Input file: I am interested in calculating the CDF of col[2] i.e. column 3 only
4814 2464 27 0.000627707861971 117923.0
4211 736 2 4.64968786645 05 2576.0
2075 1339 30 0.000697453179968 499822.0
2441 2381 3 6.97453179968 05 1968.0
4694 1738 1 2.32484393323 05 5702.0
4406 3008 12 0.000278981271987 8483.0
3622 1396 3 6.97453179968 05 2564.0
5425 478 1 2.32484393323 05 428.0
4489 1715 6 0.000139490635994 19045.0
3695 3387 2 4.64968786645 05 16195.0
Upvotes: 2
Views: 1425
Reputation:
Multiple things are really wrong here.
Look at them carefully:
4814 2464 27 0.000627707861971 117923.0
4211 736 2 4.64968786645 05 2576.0
2075 1339 30 0.000697453179968 499822.0
2441 2381 3 6.97453179968 05 1968.0
4694 1738 1 2.32484393323 05 5702.0
4406 3008 12 0.000278981271987 8483.0
3622 1396 3 6.97453179968 05 2564.0
5425 478 1 2.32484393323 05 428.0
4489 1715 6 0.000139490635994 19045.0
3695 3387 2 4.64968786645 05 16195.0
Sometimes you got 6 columns as in:
4211 736 2 4.64968786645 05 2576.0
and sometimes you only got 5:
4814 2464 27 0.000627707861971 117923.0
So the first thing is to learn how to write data correctly.
Imagine that all you data are in a 2D numpy array called data
.
You could call:
numpy.savetxt("input.txt", data)
or, to get more control over formating:
numpy.savetxt("input.txt", data, fmt="%d %d %d %.6f %d %.1f")
The fmt=
parameter is a way to tell numpy how you want to save your data (%d
means write it as an integer, %f
means write it as a float, %.5f
means write it as a float with only 5 decimals).
If you want to write it yourself, you could do something like:
fmt = "%d %d %d %.6f %d %.1f"
with open("input.txt", "w") as f:
for row in data:
f.write(fmt%row+"\n")
If the lines with 5 columns instead of 6 are what you really want to write, then use another delimiter like ,
. This way,
4814,2464,27,0.000627707861971,,117923.0
is obviously containing 6 columns.
What I call valid data is consistent data, data which always contains the same number of columns.
You should really use numpy.loadtxt
or numpy.genfromtxt
(the latter one is it use if data are missing). Note that you can specify a delimiter for both of them using the delimiter
argument.
data = numpy.loadtxt("valid_input.txt")
col = data[:,2]
or equivalently you could use the usecols
argument together with the unpack
one.
For your data, the method with usecols
is working is you select only the third column (column 2 in Python lingua) if you don't have any other wrongness before column 2 elsewhere.
You could do it by hand which would bring us to another wrongness:
There, you just replace the variable data with a single value (the one in cols[2]
):
with open('input.txt', 'r') as f:
for rows in f:
cols = rows.split()
data = cols[2]
There you try to sort a single value:
sorted_data = np.sort(data)
There you want to get the length of a single value:
cdf = np.arange(len(data))/float(len(data))
plt.plot(sorted_data, cdf, '-bs')
plt.show()
I'm really surprised numpy
does not complain.
You are getting one row at a time: you need to store these values somewhere (in a list for instance) and then do your stuff about it.
numpy.loadtxt
can't load your data (it tries to load everything by default) because it can't infer what you want to do with 6 columns or 5 columns depending on the row. So the only thing it can do is failing.
First, don't get offended: what I'm gonna say is to help you improve. I'm not judging you in any way, just showing you how you should react in front of this kind of errors, trivial or not.
The problem is that you seem to have just copy-pasted the errors without having actually looked at them so without having tried to understand them (but I may be wrong, I'm not in your head :)).
But what's for sure is that you have not copy-pasted them in your favorite search engine because answers are plenty. Again, I may be wrong. Maybe you did this but without seeing how these answers could apply to your case. Though, the first on Google answer about
ValueError: x and y must have same first dimension
is pretty explicit. You don't even have to mention this is matplotlib
or Python. Then you would have discovered that sorted_data
is not the same length as cdf
. With a little more work, you would have figured out what I said before about your implementations.
As you've seen, I've not given a "canonical answer" and I won't since I consider that you have not done your part of the job. But you can still do it: I've given you all the tools you need to answer your own question. That don't mean that you have to do it all alone on a remote island: I've almost given a complete answer (really), the doc can help and Google too :). All you have to do is searching a tiny bit for it. Once you have something working, edit your question (or answer to your own question).
Upvotes: 4