Reputation: 7694
I need to scrape an URL list obtained by a Google search, using the Apify platform.
My plan is to start from a Google Search Scraper Actor task. However I don't think it can be used to scrape anything else than the Google search results (maybe I'm wrong ?). Therefore I need to provide its output to another Actor task, e.g. a Web Scraper or a Puppeteer Scraper.
But I can't seem to find the documentation related to the chaining of Actors. How should I proceed ?
I found How to pass data from crawler to actor, and setting an ACTOR.RUN.SUCCEEDED
webhook on the Run task API endpoint of the second actor seems to work (that is, the second actor is launched).
However I can't seem to find how to pass the first actor's dataset to the second actor : the Start URLs field being mandatory I guess I should set it to the dataset, however the dataset link is different for each run…
Upvotes: 1
Views: 1714
Reputation: 677
You can chain multiple actor runs either via the Metamorph feature, or using Webhooks.
Metamorph allows you to run an actor and while the actor is running, "morph" it into a different actor with a custom input. The original actor will be stopped and replaced by the second one, but both will use the same storages, have the same run ID and will be displayed as a single actor run in the Apify app. You can use metamorph multiple times in a single run.
You can find the documentation for Metamorph here.
Webhooks allow you to call an arbitrary API endpoint once an actor reaches a given status, for example: SUCCEEDED. You can use this to call the Run Actor API to start another actor. You can set a custom payload for the webhooks, however, at this moment, passing output directly as webhook payload is not supported, so you'll need to use the ID of a key value store or dataset, where your results are stored and read it from there.
For example, to get the IDs of both key value store and dataset of the original actor, you would configure a payload like this:
{
"datasetId": {{resource.defaultDatasetId}},
"keyValueStoreId": {{resource.defaultKeyValueStoreId}}
}
The task is not trivial because the Google Search output format is not compatible with the Web Scraper input format. The best way to do this is to create an intermediary actor that uses the output from Google Search Scraper to produce an input for Web Scraper and then metamorph into it. So the final flow is:
Google Search Scraper --webhook-->
Output Processor Actor --metamorph-->
Web Scraper.
Upvotes: 3