Efficiently creating lots of Histograms from grouped data held in pandas dataframe

Question

I want to create a bunch of histograms from grouped data in pandas dataframe. Here's a link to a similar question. To generate some toy data that is very similar to what I am working with you can use the following code:

    from pandas import DataFrame
    import numpy as np
    x = ['A']*300 + ['B']*400 + ['C']*300
    y = np.random.randn(1000)
    df = DataFrame({'Letter':x, 'N':y})

I want to put those histograms (read the binned data) in a new dataframe and save that for later processing. Here's the real kicker, my file is 6 GB, with 400k+ groups, just 2 columns.

I've thought about using a simple for loop to do the work:

    data=[]
    for group in df['Letter'].unique():
        data.append(np.histogram(df[df['Letter']==group]['N'],range=(-2000,2000),bins=50,density=True)[0])
    df2=DataFrame(data)

Note that the bins, range, and density keywords are all necessary for my purposes so that the histograms are consistent and normalized across the rows in my new dataframe df2 (parameter values are from my real dataset so its overkill on the toy dataset). And the for loop works great, on the toy dataset generates pandas dataframe of 3 rows and 50 columns as expected. On my real dataset I've estimated that time to completion of the code would be around 9 days. Is there any better/faster way to do what I'm looking for?

P.S. I've thought about multiprocessing, but I think the overhead of creating processes and slicing data would be slower than just running this serially (I may be wrong and wouldn't mind to be corrected on this one).

Ami Tavory · Accepted Answer

For the type of problem you describe here, I personally usually do the following, which is basically delegate the whole thing to multithreaded Cython/C++. It's a bit of work, but not impossible, and I'm not sure there's really a viable alternative at the moment.

Here are the building blocks:

First, your df.x.values, df.y.values are just numpy arrays. This link shows how to get C-pointers from such arrays.
Now that you have pointers, you can write a true multithreaded program using Cython's prange and foregoing any Python from this point (you're now in C++ territory). So say you have k threads scanning your 6GB arrays, and thread i handles groups whose keys have a hash that is i modulo k.
For a C program (which is what your code really is now) the GNU Scientific Library has a nice histogram module.
When the prange is done, you need to convert the C++ structures back to numpy arrays, and from there back to a DataFrame. Wrap the whole thing up in Cython, and use it like a normal Python function.

Efficiently creating lots of Histograms from grouped data held in pandas dataframe

Answers (1)

Related Questions