Reputation: 321

Splitting data into subsamples

I have a huge dataset which contains coordinates of particles. In order to split the data into test and training set I want to divide the space into many subspaces; I did this with a for-loop in every direction (x,y,z) but when running the code it takes very long and is not efficient enough especially for large datasets:

particle_boxes = []


init = 0
final = 50
number_box = 5

for i in range(number_box):
    
    for j in range(number_box):
        
        for k in range(number_box):

            index_particle = df_particles['X'].between(init+i*final, final+final*i)&df_particles['Y'].between(init+j*final, final+final*j)&df_particles['Z'].between(init+k*final, final+final*k)
        
    
            particle_boxes.append(df_particles[index_particle])

where init and final define the box size, df_particles contains every particle coordinate (x,y,z).

After running this particle_boxes contains 125 (number_box^3) equal spaced subboxes.

Is there any way to write this code more efficiently?

Upvotes: 1

Answers (3)

P.S.K

Reputation: 101

Have a look at train_test_split function available in the scikit-learn lib.

I think it is almost the kind of functionality that you need.

The code is consultable on Github.

Upvotes: 1

piRSquared

Reputation: 294258

Note on efficiency

I conducted a number of tests using other tricks and nothing changed substantially. This is roughly as good as any other technique I used.

I'm curious to see if anyone else comes up with something order of magnitude faster.

Sample data

np.random.seed([3, 1415])
df_particles = pd.DataFrame(
    np.random.randint(250, size=(1000, 3)),
    columns=['X', 'Y', 'Z']
)

Solution

Construct an array a that represents your boundaries

a = np.array([50, 100, 150, 200, 250])

Then use searchsorted to create the individual dimensional bins

x_bin = a.searchsorted(df_particles['X'].to_numpy())
y_bin = a.searchsorted(df_particles['Y'].to_numpy())
z_bin = a.searchsorted(df_particles['Z'].to_numpy())

Use groupby on the three bins. I used trickery to get that into a dict

g = dict((*df_particles.groupby([x_bin, y_bin, z_bin]),))

We can see the first zone

g[(0, 0, 0)]

      X   Y   Z
30    2  36  47
194   0  34  45
276  46  37  34
364  10  16  21
378   4  15   4
429  12  34  13
645  36  17   5
743  18  36  13
876  46  11  34

and the last

g[(4, 4, 4)]

       X    Y    Z
87   223  236  213
125  206  241  249
174  218  247  221
234  222  204  237
298  208  211  225
461  234  204  238
596  209  229  241
731  210  220  242
761  225  215  231
762  206  241  240
840  211  241  238
846  212  242  241
899  249  203  228
970  214  217  232
981  236  216  248

Upvotes: 3

Parfait

Reputation: 107587

Instead of multiple nested for loops, consider one loop using itertools.product. But of course avoid any loops if possible as @piRSquared shows:

from itertools import product

particle_boxes = []

for i, j, k in product(range(number_box), range(number_box), range(number_box)):

    index_particle  = (df_particles['X'].between(init+i*final, final+final*i) & 
                       df_particles['Y'].between(init+j*final, final+final*j) & 
                       df_particles['Z'].between(init+k*final, final+final*k))

    particle_boxes.append(df_particles[index_particle])

Alternatively, with list comprehension:

def sub_df(i, j, k)
   index_particle  = (df_particles['X'].between(init+i*final, final+final*i) & 
                      df_particles['Y'].between(init+j*final, final+final*j) & 
                      df_particles['Z'].between(init+k*final, final+final*k))

   return df_particles[index_particle]

particle_boxes = [sub_df(i, j, k) for product(range(number_box), range(number_box), range(number_box))]

Upvotes: 1

Splitting data into subsamples

Answers (3)

Note on efficiency

Sample data

Solution

Related Questions