I have searched, and searched (for 4 days) before posting this. I apologize in advance if it is too elementary, and a waste of your time. I have successfully generated some basic plots using pyplot, and matplotlib by using their tutorial's examples, but to no avail for what I need to accomplish. Essentially: I have a list of numbers that exist in a single file. Each line contains a number corresponding to the number of milliseconds that it takes to complete a certain repeated task. There are over a million entries in this file, and it can grow beyond that. Example of 20: 173 1685 1152 253 1623 390 84 40 319 86 54 991 1012 721 3074 4227 4927 181 4856 1415 Eventually what I'll need to do is calculate a range of individual totals (distributed evenly over the absolute total number of entries) -- and then plot those averages using any of the plotting libs for python. I have considered using pyplot for ease of use. The X axis will correspond to the total number of tasks completed, as the Y axis will represent the number of milliseconds it takes to complete the task (for this example the average time it takes to complete every 5). ie: Entries 1-5 = (plottedTotalA) Entries 6-10 = (plottedTotalB) Entries 11-15 = (plottedTotalC) Entries 16-20 = (plottedTotalD) From what I can tell, I don't need to indefinitely store the values of the variables, only pass them as they are processed (in order) to the plotter. I have tried the following example to sum a range of 5 entries from the above list of 20 (which works), but I don't know how to dynamically pass the 5 at a time until completion, all the while retaining the calculated averages which will ultimately be passed to pyplot. ex: Python 2.7.3 (default, Jul 24 2012, 10:05:38) [GCC 4.7.0 20120507 (Red Hat 4.7.0-5)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> plottedTotalA = ['173', '1685', '1152', '253', '1623'] >>> sum(float(t) for t in plottedTotalA) 4886.0

Reputation: 31

Calculating and Plotting the Average of every (X) items in a list of (Y) total

I have searched, and searched (for 4 days) before posting this. I apologize in advance if it is too elementary, and a waste of your time. I have successfully generated some basic plots using pyplot, and matplotlib by using their tutorial's examples, but to no avail for what I need to accomplish.

Essentially:

I have a list of numbers that exist in a single file.
Each line contains a number corresponding to the number of milliseconds that it takes to complete a certain repeated task.
There are over a million entries in this file, and it can grow beyond that.

Example of 20:

Eventually what I'll need to do is calculate a range of individual totals (distributed evenly over the absolute total number of entries) -- and then plot those averages using any of the plotting libs for python. I have considered using pyplot for ease of use.

The X axis will correspond to the total number of tasks completed, as the Y axis will represent the number of milliseconds it takes to complete the task (for this example the average time it takes to complete every 5).

ie:

Entries 1-5 = (plottedTotalA)
Entries 6-10 = (plottedTotalB)
Entries 11-15 = (plottedTotalC)
Entries 16-20 = (plottedTotalD)

From what I can tell, I don't need to indefinitely store the values of the variables, only pass them as they are processed (in order) to the plotter. I have tried the following example to sum a range of 5 entries from the above list of 20 (which works), but I don't know how to dynamically pass the 5 at a time until completion, all the while retaining the calculated averages which will ultimately be passed to pyplot.

ex:

Python 2.7.3 (default, Jul 24 2012, 10:05:38) 
[GCC 4.7.0 20120507 (Red Hat 4.7.0-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> plottedTotalA = ['173', '1685', '1152', '253', '1623']
>>> sum(float(t) for t in plottedTotalA)
4886.0

Upvotes: 3

Answers (2)

Ben K.

Reputation: 1150

Let's assume you have your n values in a list called x. Then reshape x into an array A with 5 columns and calculate the mean for each line. Then you can simply plot the resulting vector.

x = np.array(x)
n = x.size
A = x[:(n // 5) * 5].reshape(5, -1)
y = A.mean(axis = 0)
plot(y)

EDIT: changed my code according to tacaswell's comment

However, you might run into memory problems if you actually have over a million entries. You could also use the name x instead of A and y. This way you would overwrite the initial data and save some memory.

I hope this helps

Upvotes: 4

sotapme

Reputation: 4903

I've taken the problem to be how to get 5 items from a list that's generated from a file. As you said:

I don't know how to dynamically pass the 5 at a time until completion,

I've used /dev/random as it's never ending and random and simulates your big file and shows processing a big file without reading into a list or similar slurping of data.

################################################################################
def bigfile():
    """Never ending list of random numbers"""
    import struct
    with open('/dev/random') as f:
        while True:
            yield  struct.unpack("H",f.read(2))[0]
################################################################################
def avg(l):
    """Noddy version"""
    return sum(l)/len(l)
################################################################################

bigfile_i = bigfile()

import itertools
## Grouper recipe @ itertools
by_5  = itertools.imap(None, *[iter(bigfile_i)]*5)

# Only take 5, 10 times.
for x in range(10):
    l = by_5.next()
    a = avg(l)
    print l, a ## PLOT ?

EDIT

Detail of what happens to the remainder.

If we pretend the file has a 11 lines and we take 5 each time:

In [591]: list(itertools.izip_longest(*[iter(range(11))]*5))
Out[591]: [(0, 1, 2, 3, 4), (5, 6, 7, 8, 9), (10, None, None, None, None)]

In [592]: list(itertools.imap(None, *[iter(range(11))]*5))
Out[592]: [(0, 1, 2, 3, 4), (5, 6, 7, 8, 9)]

In [593]: list(itertools.izip(*[iter(range(11))]*5))
Out[593]: [(0, 1, 2, 3, 4), (5, 6, 7, 8, 9)]

In one case izip_longest will fill the remainder with None whereas imap and izip wil truncate. I can imagine the OP will want to perhaps use itertools.izip_longest(*iterables[,fillvalue]) for the optional fill value, although None is a good sentinel for No Values.

I hope that makes it clear what happens to the remainder.

Upvotes: 1

Calculating and Plotting the Average of every (X) items in a list of (Y) total

Answers (2)

Related Questions