Reputation: 360
I am trying to read all AAD related questions and answers from Stack Exchange API /2.2/search/advanced/pagesize=100&fromdate=2019-07-01&todate=2020-10-19&site=stackoverflow&filter=!BLIw93LDFyFBUjlepdSTkMo7r6Pkpx&q=listOfTags
by passing set of tags, since we are trying to get the data from July 1st 2019.
Our ADF pipeline keeps getting throttlede and even if we set the wait time for 1 minute and our ETL is very slow, it's running forever.
Current Approach (very slow)
I am using ADF to Pull the all the questions (iterating through page by page using until activity) which meets the tags and load the data into SQL
Pass the question id to this API https://api.stackexchange.com/docs/answers-on-questions#order=desc&sort=activity&ids=29433422&filter=!0U7YRMKgNJq(Exonzn(PdiZE5&site=stackoverflow&run=true to get all the answers for respective question and then load the result into SQL.
Questions:
Is there a direct back-end (Kusto or SQL or cosmos etc.) we can get the data than calling the API to get the question and answers? If so how do we get the access to the back-end?
What is the efficient approach to pull the historical data without throttling from Stack Overflow?
Upvotes: 0
Views: 162
Reputation: 5520
You are being throttled because you probably have made 300 requests (maximum calls without a key) or because the URL is invalid. FWIW, registering your application on StackApps increases your API quota from 300 to 10,000! You can then pass it as a parameter: &key=...
. Now, regarding the URL:
.../advanced/pagesize=100...
. It should be /advanced?pagesize=100¶m=value...
.YYYY-MM-DD
. They should be in Unix epoch time!. In your case fromdate
should be 1561939200
and todate
1603065600
(Note: if you want to fetch results until today, then you can omit this parameter).I'm not sure I understand what you're trying to do. However, if the API is suitable for your task, then you don't need such a big delay. It probably should be < 1sec. What you should do is to check if backoff
field exists in the API response. If it does, then wait that many seconds before proceeding.
With regards to your first question... how about SEDE? You can run SQL queries for any site you want there and get the results in CSV format. Here is the help page and you can find the public schema in this Meta Stack Exchange question. If you encounter difficulties, feel free to ask a new question.
References:
Upvotes: 0
Reputation: 1796
#1 ) I doubt if there is anything like that . #2) Throttling will happen on the client IP , may be you can try to deploy the same ADF pipeline on different region , that may help . if you go that route you will have to update the API with the date filter , so that no two region query the same set of data .
Upvotes: 0