How do I map this function over pyspark

Question

I might be approaching this completely wrong, but I currently have a function as shown below that gets the link of the first youtube video that I that comes up on the search results, given a string input:

def searchYTLink(title):
    query = urllib.parse.quote(title)
    url = "https://www.youtube.com/results?search_query=" + query
    response = urllib.request.urlopen(url)
    html = response.read()
    soup = BeautifulSoup(html, 'html.parser')
    result =  soup.findAll(attrs={'class': 'yt-uix-tile-link'})[0]
    return 'https://www.youtube.com' + result['href']

Now I want to input a list of strings to this function and map it over all my worker nodes. To achieve this, I wrote the code below:

# Make sure that you initialize the Sppark Context
sc = SparkContext(appName="MusicClassifier")
searchTest = ['videoa', 'videob', ...]
sc.parallelize(searchTest).map(searchYTLink)

Is this the right way to do this?

tel · Accepted Answer

One tiny thing to fix - you need an action

Your example looks fine, up to a point. In order to actually execute any code you'll need to add an action to the end of your chain of RDD methods. The most straightforward action is typically collect, which will just gather the final value of each partition and return it as a single RDD:

sc.parallelize(searchTest).map(searchYTLink).collect()

Notes

You do indeed appear to be using map correctly. The function you pass into it should take exactly one argument, which searchYTLink does.
For performance reasons, you may also want to look into mapPartitions(func). mapPartitions is exactly like map, but in this case func should be a function that takes a whole chunk of values at a time.

How do I map this function over pyspark

Answers (1)

One tiny thing to fix - you need an action

Notes

Related Questions