Reputation: 135
Before I start doing this I wanted to get advice from the community on the best and most efficient manner to go about doing it.
Here is what I want to do:
The data will be updated twice a day, so I would pull in the data from the source APIs and put it through my pipeline twice a day.
So basically I want to create an API by aggregating data from multiple source APIs.
I've started playing with Lambda and created the following function using Python.
#https://stackoverflow.com/a/41765656
import requests
import json
def lambda_handler(event, context):
#https://www.nylas.com/blog/use-python-requests-module-rest-apis/ USEFUL!!!
#https://stackoverflow.com/a/65896274
response = requests.get("https://remoteok.com/api")
#print(response.json())
return {
'statusCode': 200,
'body': response.json()
}
#https://stackoverflow.com/questions/63733410/using-lambda-to-add-json-to-dynamodb DYNAMODB
This works and returns a JSON response.
Here are my questions:
However, before I go further I would like to know what is the best way to go about doing this.
If you have experience with this I would love to hear from you.
Upvotes: 1
Views: 589
Reputation: 31
The answer will vary depending on the quantity of data you're planning to mine. Lambdas are designed for short-duration, high-frequency workloads and thus might not be suitable.
I would recommend looking into AWS Glue, as this seems like a fairly typical ETL (Extract Transform Load) problem. You can set up glue jobs to run on a schedule, and as for data aggregation, that's the T in ETL.
It's simple to output the glue dataframe (result of a transformation) as s3 files, which can then be queried directly by Amazon Athena (as if they were db content).
As for exposing that data via an API, the serverless framework or SST are great tools for taking the sting out of spinning up a serverless API and associated resources.
Upvotes: 2