Djibril Diakhate
Djibril Diakhate

Reputation: 1

Integrating a Scrapfly Scraper into an Azure Synapse Analytics Project

I am currently working on a project called “Azure-Social-Media-Analytics-Solution-Accelerator”. This project utilizes data from two main sources: news sites and Twitter. The data is collected and processed by Jupyter notebooks, which are run by Azure Synapse Analytics through a pipeline.

At present, the project uses the Twitter API to fetch data from Twitter. However, I wish to replace the Twitter API with a Scrapfly scraper that I have developed. My scraper uses the Scrapfly API to scrape data from Twitter and is defined in two Python files.

I have tried to update the Ingest Process file notebook in my project to use the Scrapfly scraper instead of the Twitter API. However, I am unsure of the best way to proceed.

Does anyone have experience with integrating a custom scraper into an Azure Synapse Analytics project? Any help or advice would be greatly appreciated.

Thank you in advance for your help!

Upvotes: 0

Views: 137

Answers (1)

SiddheshDesai
SiddheshDesai

Reputation: 8187

  1. Another option you can use is call the API in Azure Function App as HTTP Trigger and http code then use it in your Synapse pipeline:-

Azure Functions Http Trigger:-

import logging
import azure.functions as func
from scrapfly import ScrapeConfig, ScrapflyClient

def main(req: func.HttpRequest) -> func.HttpResponse:
    logging.info('Python HTTP trigger function processed a request.')

    
    scrapfly = ScrapflyClient(key='scp-live-390c9xxxxxx54ce36')
    api_response = scrapfly.scrape(scrape_config=ScrapeConfig(url='https://httpbin.dev/anything'))
    logging.info(f'Scraping Result: {api_response.success}')
    scrape_result = api_response.scrape_result
    content = scrape_result['content']
    context = api_response.context
    status_code = api_response.status_code
    upstream_status_code = api_response.upstream_status_code

    return func.HttpResponse(
        f"Scrape Result: {api_response.success}. Content: {content}. Context: {context}. Status Code: {status_code}. Upstream Status Code: {upstream_status_code}",
        status_code=200
    )

enter image description here

  1. You can also create a Logic App workflow to call the Scrapfly API like below then call the Logic App url into Azure synapse web activity:-

Scrapfly API:-

{
    "url": "https://api.scrapfly.io/scrape",
    "method": "GET",
    "headers": {
        "Content-Type": "application/json"
    },
    "body": {
        "url": "https://twitter.com",
        "key": "xxxxxxxxx5fc8fa67dso",
        "proxy_pool": "public_datacenter_pool",
        "headers": {
            "content-type": "application/json",
            "Cookie": "test=1;auth=1"
        },
        "country": "us",
        "lang": "en",
        "os": "win11",
        "timeout": 30000
    }
}

enter image description here

Then invoke this Logic app via Azure synapse Web activity similar to this MS document used to invoke email from logic app in web activity

Upvotes: 0

Related Questions