Fernando Garrido Vaz
Fernando Garrido Vaz

Reputation: 45

Scatter plot in pylab: arranging axis and data

I'm trying to put together a scatter plot in pylab and so far have failed miserably. I'm not a programmer as such, so please bear with me.

I have a data set composed of two columns of data contained in a csv file with around 60k lines. Here's a sample:

100000000012640,0.888888888888889
100000000105442,0.777777777777778
100000000206866,1.0
100000000304930,0.777777777777778
100000000583236,0.888888888888889
100000000683528,0.777777777777778
718435316,1.0
718494043,0.777777777777778
718602951,0.777777777777778
718660499,0.777777777777778
718766852,1.0
718795104,1.0
718862926,0.777777777777778
718927526,0.777777777777778
718952836,1.0
719102865,0.777777777777778
719156726,1.0
719213511,1.0
719425334,1.0
719452158,1.0
719493947,0.777777777777778
719566609,1.0
720090346,0.777777777777778
720127760,0.777777777777778
720143948,0.944444444444444
720221566,1.0
720256688,0.944444444444444
720349817,0.777777777777778
720380601,0.777777777777778
720446322,1.0
720524740,1.0
720560353,1.0
720594066,0.777777777777778
720673388,1.0
720716865,0.777777777777778
720730249,1.0
720774433,1.0

My goal is to draw a scatter plot of this data, with the first row of data on the x axis and the second row on the y axis. The values for the x axis are sorted in descending order, start at the values shown and end at 999963505. The values for the y axis are always between 0 and 1.

Here's what I've tried (using "ipython --pylab"):

data = loadtxt('./data/OD-4322/facebookID.csv', unpack=True, dtype=('float', 'float'), delimiter=',')
scatter(data[0],data[1])

This gets me something that resembles a scatter plot, but not quite what I'm looking for:

http://content.screencast.com/users/FernandoGarridoVaz/folders/Jing/media/a0df81c5-2dbb-4e93-8e18-3c9db07728f5/00000793.png

(I would post the image directly but my reputation in the site does not allow it yet).

How can I make this so that the x axis are in the same range as my values? Why are the points in my plot all piled up on 0 and 1, when in truth they are distributed all over the place between 0 and 1?

Upvotes: 1

Views: 730

Answers (1)

Schuh
Schuh

Reputation: 1095

Pylab uses numpy, you can look up the provided data formats here. You use very high numbers in the first column and have no need for float double precision but for hight integer values. Look at the example data you've pasted:

>>> x = np.loadtxt('./temp.dat', unpack=True, dtype=('float'), delimiter=',')[0] 
>>> x
array([  1.00000000e+14,   1.00000000e+14,   1.00000000e+14,
     1.00000000e+14,   1.00000001e+14,   1.00000001e+14])
>>> x = np.loadtxt('./temp.dat', unpack=True, dtype=('uint64'), delimiter=',')[0]
>>> x
array([100000000012640, 100000000105442, 100000000206866, 100000000304930,
   100000000583236, 100000000683528], dtype=uint64)
>>> y = np.loadtxt('./temp.dat', unpack=True, dtype=('float'), delimiter=',')[1]
>>> scatter(x,y)

Note that the what you're doing in your line scatter(data[0],data[1]) is done just after the loadtxt() statement for the two columns. The first function show your data after reading in as float. Using the data read in as `uint64' will help you with your scatterplot.

Good point to start from : matplotlib gallery

Edit to answer your comment, more control over reading of the input data:

# create python lists to store the data
x_vals = []
y_vals = []
#open file and read in a list containing all lines as string
f = open("./temp.dat","r")
lines = f.readlines()
#Go through the lines
   #strip() takes away "\n" characters and such
   #split(",") creates a list of the string line splitted into (here: 2) substrings
for line in lines:
   x,y = line.strip().split(",")
   #append values to their lists and apply the right format
   x_vals.append(np.uint64(x))
   y_vals.append(np.float64(y))

scatter(x_vals,y_vals)
#or just plot the data as points using:
plot(x_vals,y_vals,"o")

Your data has a very huge range between min and max values, you will get better results when you divide the set into the small and the large numbers

Upvotes: 1

Related Questions