Reputation: 2342
I might be approaching this completely wrong, but I currently have a function as shown below that gets the link of the first youtube video that I that comes up on the search results, given a string input:
def searchYTLink(title):
query = urllib.parse.quote(title)
url = "https://www.youtube.com/results?search_query=" + query
response = urllib.request.urlopen(url)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
result = soup.findAll(attrs={'class': 'yt-uix-tile-link'})[0]
return 'https://www.youtube.com' + result['href']
Now I want to input a list of strings to this function and map it over all my worker nodes. To achieve this, I wrote the code below:
# Make sure that you initialize the Sppark Context
sc = SparkContext(appName="MusicClassifier")
searchTest = ['videoa', 'videob', ...]
sc.parallelize(searchTest).map(searchYTLink)
Is this the right way to do this?
Upvotes: 0
Views: 273
Reputation: 13997
Your example looks fine, up to a point. In order to actually execute any code you'll need to add an action to the end of your chain of RDD
methods. The most straightforward action is typically collect
, which will just gather the final value of each partition and return it as a single RDD
:
sc.parallelize(searchTest).map(searchYTLink).collect()
You do indeed appear to be using map
correctly. The function you pass into it should take exactly one argument, which searchYTLink
does.
For performance reasons, you may also want to look into mapPartitions(func)
. mapPartitions
is exactly like map
, but in this case func
should be a function that takes a whole chunk of values at a time.
Upvotes: 1