Reputation: 1
I am currently working on a project called “Azure-Social-Media-Analytics-Solution-Accelerator”. This project utilizes data from two main sources: news sites and Twitter. The data is collected and processed by Jupyter notebooks, which are run by Azure Synapse Analytics through a pipeline.
At present, the project uses the Twitter API to fetch data from Twitter. However, I wish to replace the Twitter API with a Scrapfly scraper that I have developed. My scraper uses the Scrapfly API to scrape data from Twitter and is defined in two Python files.
I have tried to update the Ingest Process file notebook in my project to use the Scrapfly scraper instead of the Twitter API. However, I am unsure of the best way to proceed.
Does anyone have experience with integrating a custom scraper into an Azure Synapse Analytics project? Any help or advice would be greatly appreciated.
Thank you in advance for your help!
Upvotes: 0
Views: 137
Reputation: 8187
Azure Functions Http Trigger:-
import logging
import azure.functions as func
from scrapfly import ScrapeConfig, ScrapflyClient
def main(req: func.HttpRequest) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')
scrapfly = ScrapflyClient(key='scp-live-390c9xxxxxx54ce36')
api_response = scrapfly.scrape(scrape_config=ScrapeConfig(url='https://httpbin.dev/anything'))
logging.info(f'Scraping Result: {api_response.success}')
scrape_result = api_response.scrape_result
content = scrape_result['content']
context = api_response.context
status_code = api_response.status_code
upstream_status_code = api_response.upstream_status_code
return func.HttpResponse(
f"Scrape Result: {api_response.success}. Content: {content}. Context: {context}. Status Code: {status_code}. Upstream Status Code: {upstream_status_code}",
status_code=200
)
Scrapfly API:-
{
"url": "https://api.scrapfly.io/scrape",
"method": "GET",
"headers": {
"Content-Type": "application/json"
},
"body": {
"url": "https://twitter.com",
"key": "xxxxxxxxx5fc8fa67dso",
"proxy_pool": "public_datacenter_pool",
"headers": {
"content-type": "application/json",
"Cookie": "test=1;auth=1"
},
"country": "us",
"lang": "en",
"os": "win11",
"timeout": 30000
}
}
Then invoke this Logic app via Azure synapse Web activity similar to this MS document used to invoke email from logic app in web activity
Upvotes: 0