Spark: Warning that task size is too large despite no large, nondistributed files

Question

Here is the idea of my code:

I have a large RDD of email data, called email. About 700 million emails. It looks like this:

[['value1','value2','value3','value4'],['recipient1','recipient2','recipient3'],['sender']]

There are over 40,000 distinct recipeint and sender email addresses in email. I have a list of 600 email addresses I am interested in, shown below:

relevant_emails = ['rel_email1','rel_email2','rel_email3',...,'rel_email600']

I want to iterate through my large RDD email to keep only those emails where both the sender and the recipient fall in the list of relevant_emails. So, I broadcast the relevant_emails so that each worker node will have a copy: broadcast_emails = sc.broadcast(relevant_emails).

Here is the function that I want to apply to each row in email:

def get_relevant_emails(row):
    r_bool = False
    s_bool = False
    recipients = row[1]
    sender = row[2]
    if sender[0] in broadcast_emails.value:
        s_bool = True
    for x in range(0, len(recipients)):
        if recipients[x] in broadcast_emails.value:
            r_bool = True
            break
    if (r_bool is True and s_bool is True):
        return row

The problem I face is that when I run emails.map(lambda row: get_relevant_emails(row)) (and then follow it up with something that forces it to execute, such as saveAsTextFile()), it starts to run, then sends this:

WARN: Stage 5 contains a task of very large size (xxxx KB). The maximum recommended task size is 100 KB

Then it stops running. FYI: I am running this in a Spark shell, with 20 executors, 10GB of memory per executor, and 3 cores per executor. email is of size 76.7 GB in terms of block storage consumption on HDFS, and I've got it in 600 partitions (76.7 GB / 128 MB).

Spark: Warning that task size is too large despite no large, nondistributed files

Answers (1)

Related Questions