Reputation: 5709
I am currently trying to get a lot of data about video games out of Wikipedia using their public API. I've gotten some of the way. I can currently get all the pageid
I need with their associated article title
. But then I need to get their Unique Identifiers (Qxxxx where x are numbers) and that takes quite a while...possibly because I have to make single queries for every title (there are 22031) or because I don't understand Wikipedia Queries.
So I thought "Why not just make multiple queries at once?" so I started working on that, but I've run into the issue in the title. After the program has run for a while (usually 3-4 minutes) about a minute passes then the application crashes with the error in the title. I think it's because my approach is just bad:
ConcurrentBag<Entry> entrybag = new ConcurrentBag<Entry>(entries);
Console.WriteLine("Getting Wikibase Item Ids...");
Parallel.ForEach<Entry>(entrybag, (entry) =>
{
entry.WikibaseItemId = GetWikibaseItemId(entry).Result;
});
Here is the method that is called:
async static Task<String> GetWikibaseItemId(Entry entry)
{
using (var client = new HttpClient(new HttpClientHandler { AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate }))
{
client.BaseAddress = new Uri("https://en.wikipedia.org/w/api.php");
entry.Title.Replace("+", "Plus");
entry.Title.Replace("&", "and");
String queryString = "?action=query&prop=pageprops&ppprop=wikibase_item&format=json&redirects=1&titles=" + entry.Title;
HttpResponseMessage response = await client.GetAsync(queryString);
response.EnsureSuccessStatusCode();
String result = response.Content.ReadAsStringAsync().Result;
dynamic deserialized = JsonConvert.DeserializeObject(result);
String data = deserialized.ToString();
try
{
if (data.Contains("wikibase_item"))
{
return deserialized["query"]["pages"]["" + entry.PageId + ""]["pageprops"]["wikibase_item"].ToString();
}
else
{
return "NONE";
}
}
catch (RuntimeBinderException)
{
return "NULL";
}
catch (Exception)
{
return "ERROR";
}
}
}
And just for good measure, here is the Entry Class:
public class Entry
{
public EntryCategory Category { get; set; }
public int PageId { get; set; }
public String Title { get; set; }
public String WikibaseItemId { get; set; }
}
Could anyone perhaps help out? Do I just need to change how I query or something else?
Upvotes: 1
Views: 653
Reputation: 8726
Initiating roughly 22000 http requests in parallel from one process is just too much. If your machine had unlimited resources and internet connection bandwidth, this would come close to a denial-of-service attack.
What you see is either TCP/IP port exhaustion or queue contention. To resolve it, process your array in smaller chunks, for example fetch 10 items, process those in parallel, then fetch the next ten, and so on.
Specifically Wikimedia sites have a recommendation to process requests serially:
There is no hard and fast limit on read requests, but we ask that you be considerate and try not to take a site down. Most sysadmins reserve the right to unceremoniously block you if you do endanger the stability of their site.
If you make your requests in series rather than in parallel (i.e. wait for the one request to finish before sending a new request, such that you're never making more than one request at the same time), then you should definitely be fine.
Be sure to check their API terms of service to learn whether and how many parallel requests would be in compliance.
Upvotes: 1