How to use Spark for machine learning workflows with big models

Question

Is there a memory efficient way to apply large (>4GB) models to Spark Dataframes without running into memory issues?

We recently ported a custom pipeline framework over to Spark (using python and pyspark) and ran into problems when applying large models like Word2Vec and Autoencoders to tokenized text inputs. First I very naively converted the transformation calls to udfs (both pandas and spark "native" ones), which was fine, as long as the models/utilities used were small enough to either be broadcasted, or instantiated repeatedly:

@pandas_udf("array")
def tokenize_sentence(sentences: pandas.Series):
    return sentences.map(lambda sentence: tokenize.word_tokenize(sentence))

Trying the same approach with large models (e.g. for embedding those tokens into vector space via word2vec) resulted in terrible performance, and I get why:

@pandas_udf("array>")
def rows_to_lists_of_vectors(rows):
    model = api.load('word2vec-google-news-300')

    def words_to_vectors(words) -> List[List[float]]:
        vectors = []
        for word in words:
            if word in model:
                vec = model[word]
                vectors.append(vec.tolist())
        return vectors
    return rows.map(words_to_vectors)

The code from above would instantiate the ~4Gb word2vec model repeatedly, loading it from disk into RAM, which is very slow. I could remedy this by using mapPartition, which would at least only load it once per partition. But more importantly this would crash with memory related issues (at least on my dev machine), if I didn't heavily restrict the number of tasks, which in turn made the small udfs very slow. For example restricting the number of tasks to 2 would solve the memory crashes, but make the tokenizing painfully slow.

I understand there is an entire pipeline framework in Spark, that would fit our needs, but before committing to that, I'd like to understand how the problems I ran into were solved there. Maybe there are some key practices we could use instead of having to rewrite our framework.

My actual question therefore is twofold:

Would using the Spark pipeline framework solve our issues regarding performance and memory, assuming we wrote custom Estimators and Transformers for the steps, that are not covered by Spark out of the box (like e.g. Tokenizers and Word2Vec).
How does Spark solve those issues, if at all? Can I improve the current approach, or is this impossible using python (where to my understanding processes don't share memory space).

If any of the above makes you believe I missed a core principle with Spark, please point it out, after all I'm just getting started with Spark.

dumitru · Accepted Answer

This much vary on various factors (models, cluster resourcing, pipeline) but trying to answers to your main question :

1). Spark pipeline might solve your problem if they fits your needs in terms of the Tokenizers, Words2Vec, etc. However those are not so powerful as the one already available of the shelf and loaded with api.load. You might also want to take a look to Deeplearning4J which brings those to Java/Apache Spark and see how it can do the same things: tokenize, word2vec,etc

2). Following the current approach I would see loading the model in a foreachParition or mapPartition and ensure the model can fit into memory per partition. You can shrink down the partition size to a more affordable number based on the cluster resources to avoid memory problems (it's the same when for example instead of creating a db connection for each row you have one per partition).

Typically Spark udfs are good when you apply a kind o business logic that is spark friendly and not mixin 3rd external parties.

How to use Spark for machine learning workflows with big models

Answers (1)

Related Questions