logic1976
logic1976

Reputation: 581

How do I take advantage of parallel processing in gcloud?

Google Cloud offers a virtual machine instance with 96 cores.

I thought that having those 96 cores you would be able to divide your program into 95 slices while leaving one other core to run the program and you could thus run the program 95 times faster.

It's not working out that way however.

I'm running a simple program in Python which simply counts to 20 million.

On my Mac book it takes 4.6 seconds to run this program in serial, and 1.2 seconds when I divide this process into 4 sections and run it in parallel.

My Mac book has the following specs:

version 10.14.5
processor 2.2 GHz Intel Core i7
Mid 2015
Memory 16GB 1600 MHz DDr3

The computer that gcloud offers basically is just a tad bit slower.

When I run the program in serial it takes the same amount of time. When I divide the program into 95 division it actually takes more time: 2.6 seconds. When I divide the program into 4, it takes 1.4 seconds to complete. The computer has the following specs:

n1-highmem-96 (96 vCPUs, 624 GB memory)

I need the high memory because I'm going to index a corpus of 14 billion words. I'm going to save the locations of the 100,000 most frequent words.

When I then save those locations, I'm going to pickle them.

Pickling files takes up an enormous amount of time and will probably consume 90% of the time.

It is for this reason that I need to pickle the files as little as often. So if I can put the locations into python objects for as long as possible then I will save myself a lot of time.

Here is the python program I'm using just in case it helps:

import os, time

p = print
en = enumerate


def count_2_million(**kwargs):
    p('hey')
    start = kwargs.get("start")
    stop = kwargs.get("stop")
    fork_num = kwargs.get('fork_num')
    for x in range(start, stop):
        pass
    b = time.time()
    c = round(b - kwargs['time'], 3)
    p(f'done fork {fork_num} in {c} seconds')


def divide_range(divisions: int, total: int, idx: int, begin=0):
    sec = total // divisions
    start = (idx * sec) + begin
    if total % divisions != 0 and idx == divisions:
        stop = total
    else:
        stop = start + sec
    return start, stop


def main_fork(func, total, **kwargs):
    forks = 4
    p(f'{forks} forks')
    kwargs['time'] = time.time()
    pids = []
    for i in range(forks):
        start1, stop1 = 0, 0
        if total != -1:
            start1, stop1 = divide_range(4, total, i)

        newpid = os.fork()
        pids.append(newpid)
        kwargs['start'] = start1
        kwargs['stop'] = stop1
        kwargs['fork_num'] = i

        if newpid == 0:
            p(f'fork num {i} {start1} {stop1}')
            child(func, **kwargs)

    return pids


def child(func, **kwargs):
    func(**kwargs)
    os._exit(0)


main_fork(count_2_million, 200_000_000, **{})

Upvotes: 0

Views: 684

Answers (1)

Stefan Neacsu
Stefan Neacsu

Reputation: 693

In your specific use case I think one of the solutions is to use Clusters.

There are two main types of cluster computing workloads, in your case since you need to index 14 billion words you will need to use High-throughput computing (HTC).

What is High-throughput Computing?

A type of computing where apps have multiple tasks that are processed independently of each other without a need for the individual compute nodes to communicate. Sometimes these workloads are called embarrassingly parallel or batch workloads

When I run the program in serial it takes the same amount of time. When I divide the program into 95 division it actually takes more time: 2.6 seconds. When I divide the program into 4, it takes 1.4 seconds to complete.

For this part you should check the part of the documentation with recommended architectures and best practices so you ensure that you have the best setup to get the most of the job you want to be done.

There are some open source software like ElastiCluster that provide cluster management and support for provisioning nodes while using Google Compute Engine.

After the cluster is operational you can use HTCondor to analyze and manage the cluster nodes.

Upvotes: 2

Related Questions