Todd Huffman
Todd Huffman

Reputation: 21

Trouble obtaining data from multiple *.root files...but no problems using only one

I am using pythong version 3.6.5 and have a jagged TTree with a multi-dimensional structure. This data is spread over more than 1000 files, all with the same identical TTree structure.

suppose I have two files though and I'll call them fname1.root fname2.root

The following code has no problem opening either of these by itself:

import uproot as upr
import awkward
import boost_histogram as bh
import math
import matplotlib.pyplot as plt
#
# define a plotting program
# def plotter(h)
#
# preparing the file location for files
pth = '/fullpathName/'
fname1 = 'File755.root'
fname2 = 'File756.root'
fileList = [pth+fname1, pth+fname2]
#
# print out the path and filename that I've made to show the user
for file in fileList:
    print(file)
print('\n')
#
# Let's make a histogram This one has 50 bins, starts at zero and ends at 1000.0
# It will be a histogram of Jet pT's. 
jhist = bh.histogram(bh.axis.regular(50,0.0,1000.0))
#
#show what you've just done
print(jhist)
#
# does not work, only fills first file!
for chunk in upr.iterate(fileList,"bTag_AntiKt4EMTopoJets",["jet_pt"]):
    jhist.fill(chunk[b"jet_pt"][:, :2].flatten()*0.001)
#
#
# what does my histogram look like?
ptHist = plt.bar(jhist.axes[0].centers, jhist.view(), width=jhist.axes[0].widths)
plt.show()

As I said, the above code works if I put only ONE file in 'fileList'.

The naive thing to do doesn't work. If I create a 'list' of files using

files = [pth+fname1 , pth+fname2]

and re-run that code. I get the following error...which is very much the same error I have been getting all along.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 48, in <module>
  File "/home/huffman/.local/lib/python3.6/site-packages/uproot/tree.py", line 116, in iterate
    for tree, branchesinterp, globalentrystart, thispath, thisfile in _iterate(path, treepath, branches, awkward, localsource, xrootdsource, httpsource, **options):
  File "/home/huffman/.local/lib/python3.6/site-packages/uproot/tree.py", line 163, in _iterate
    file = uproot.rootio.open(path, localsource=localsource, xrootdsource=xrootdsource, httpsource=httpsource, **options)
  File "/home/huffman/.local/lib/python3.6/site-packages/uproot/rootio.py", line 54, in open
    return ROOTDirectory.read(openfcn(path), **options)
  File "/home/huffman/.local/lib/python3.6/site-packages/uproot/rootio.py", line 51, in <lambda>
    openfcn = lambda path: MemmapSource(path, **kwargs)
  File "/home/huffman/.local/lib/python3.6/site-packages/uproot/source/memmap.py", line 21, in __init__
    self._source = numpy.memmap(self.path, dtype=numpy.uint8, mode="r")
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_94python3/x86_64-slc6-gcc8-opt/lib/python3.6/site-packages/numpy/core/memmap.py", line 264, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
OSError: [Errno 12] Cannot allocate memory

Upvotes: 1

Views: 1319

Answers (1)

Jim Pivarski
Jim Pivarski

Reputation: 5974

Lazy arrays are just an interface for convenience—i.e. you can transform it with one function call, rather than iterating in an explicit loop over chunks. Internally, lazy arrays contain an implicit loop over chunks, so if you're running out of memory one way, you would be the other way.

Your problem is not closing files (they're memory-mapped, so "closing" didn't have a clear meaning—they're a view into memory that the operating system is allocating for itself, anyway)—your problem is with deleting arrays. That's the only thing that can use up all the memory on your computer.

There are a few things you can do here: one is to

for chunk in uproot.iterate(files, "bTag_AntiKt4EMTopoJets", ["jet_pt", "jet_eta"]):
    # fill with chunk[b"jet_pt"] and chunk[b"jet_eta"], which correspond
    # to the same sets of events, one-to-one.

to explicitly loop over the chunks ("explicit" because you see and control the loop here, and because you have to specify which branches you want to load into the dict chunk). You can control the size of the chunks with entrysteps. The other is to

cache = uproot.ArrayCache("1 GB")
events = uproot.lazyarrays(files, "bTag_AntiKt4EMTopoJets", cache=cache)

to keep the loop implicit. The ArrayCache will throw out chunks of arrays, so that they have to be loaded again, if it gets to the 1 GB limit. If you make that limit too small, it won't be able to hold one chunk, but if you make it too large, you'll run out of memory.

By the way, although you're reporting a memory issue, there's another major performance issue with your code: you're looking over events in Python. Instead of

events.jet_pt[i][:2]*0.001

to get the jet pT for event i, do

events.jet_pt[:, :2]*0.001

for the jet pT of all events as a single array. You might then need to .flatten() that array to fit the histogram's fill method.

Upvotes: 1

Related Questions