Reputation: 1697
I'm writing an application in C# that cycles through the articles of a local database copy of Wikipedia. I use a bunch of regexes to find the right information in these articles, launch a thread to fetch an image for each article, save the information and go to the next article.
I need to use a list of proxy to download these images to not get banned from google. As proxy can be slow, I use threads to make parallel downloads.
If I don't use threads, the application is working right but it takes a while to get all the information.
If I use threads, the application is working until it uses around 500 threads and then I get a OutOfMemory exception.
The thing is it use only ~300Mo of RAM so it don't uses all the memory of either the total memory availlable (8Go) and the memory allocated to a single 32bit application.
Is there a limit of thread per application ?
EDIT:
Here is the code to download the poster (started with getPosterAsc()).
string ddlValue = "";
private void tryDownload(object obj)
{
WebClient webClientProxy = new WebClient();
Tuple<WebProxy, int> proxy = (Tuple<WebProxy, int>)((object[])obj)[0];
if (proxy != null)
webClientProxy.Proxy = proxy.Item1;
try
{
ddlValue = webClientProxy.DownloadString((string)((object[])obj)[1]);
}
catch (Exception ex) {
ddlValue = "";
Console.WriteLine("trydownload:" + ex.Message);
}
webClientProxy.Dispose();
}
public void getPoster(object options = null)
{
if (options == null)
options = new object[2] { toSave, false };
if (!AppVar.debugMode && AppVar.getImages && this.getImage)
{
if (this.original_name != "" && !this.ambName && this.suitable)
{
Log.CountImgInc();
MatchCollection MatchList;
string basic_options = "";
string value = "";
WebClient webClient = new WebClient();
Regex reg;
bool found = false;
if (original_name.Split(' ').Length > 1) image_options = "";
if (!found)
{
bool succes = false;
int countTry = 0;
while (!succes)
{
Tuple<WebProxy, int> proxy = null;
if (countTry != 5)
proxy = Proxy.getProxy();
try
{
Thread t = new Thread(tryDownload);
if (!(bool)((object[])options)[1])
t.Start(new object[] { proxy, @"http://www.google.com/search?as_st=y&tbm=isch&as_q=" + image_options + "+" + basic_options + "+" + image_options_before + "%22" + simplify(original_name) + "%22+" + " OR %22" + original_name + "%22+" + image_options_after + this.image_format });
else
t.Start(new object[] { proxy, @"http://www.google.com/search?as_st=y&tbm=isch&as_q=" + image_options + "+" + basic_options + "+" + image_options_before + "%22" + simplify(original_name) + "%22+" + " OR %22" + original_name + "%22+" + image_options_after + "&biw=1218&bih=927&tbs=isz:ex,iszw:758,iszh:140,ift:jpg&tbm=isch&source=lnt&sa=X&ei=kuG7T6qaOYKr-gafsOHNCg&ved=0CIwBEKcFKAE" });
if (!t.Join(40000))
{
Proxy.badProxy(proxy.Item1.Address.Host, proxy.Item1.Address.Port);
continue;
}
else
{
value = ddlValue;
if (value != "")
succes = true;
else
Proxy.badProxy(proxy.Item1.Address.Host, proxy.Item1.Address.Port);
}
}
catch (Exception ex)
{
if (proxy != null)
Proxy.badProxy(proxy.Item1.Address.Host, proxy.Item1.Address.Port);
}
countTry++;
}
reg = new Regex(@"imgurl\=(.*?)&imgrefurl", RegexOptions.IgnoreCase);
MatchList = reg.Matches(value);
if (MatchList.Count > 0)
{
bool foundgg = false;
int j = 0;
while (!foundgg && MatchList.Count > j)
{
if (MatchList[j].Groups[1].Value.Substring(MatchList[j].Groups[1].Value.Length - 3, 3) == "jpg")
{
try
{
string guid = Guid.NewGuid().ToString();
webClient.DownloadFile(MatchList[j].Groups[1].Value, @"c:\temp\" + guid + ".jpg");
FileInfo fi = new FileInfo(@"c:\temp\" + guid + ".jpg");
this.image_size = fi.Length;
using (Image img = Image.FromFile(@"c:\temp\" + guid + ".jpg"))
{
int minHeight = this.cov_min_height;
if ((bool)((object[])options)[1])
minHeight = 100;
if (img.RawFormat.Equals(System.Drawing.Imaging.ImageFormat.Jpeg) && img.HorizontalResolution > 70 && img.Size.Height > minHeight && img.Size.Width > this.cov_min_width && this.image_size < 250000)
{
foundgg = true;
image_name = guid;
image_height = img.Height;
image_width = img.Width;
img.Dispose();
if ((bool)((object[])options)[0])
{
Mediatly.savePoster(this, (bool)((object[])options)[1]);
}
}
else
{
img.Dispose();
File.Delete(@"c:\temp\" + guid.ToString() + ".jpg");
}
}
}
catch (Exception ex)
{
}
}
j++;
}
}
}
webClient.Dispose();
Log.CountImgDec();
}
}
}
public void getPosterAsc(bool save = false, bool banner = false)
{
ThreadPool.QueueUserWorkItem(new WaitCallback(getPoster), new object[2] { save, banner });
}
Upvotes: 4
Views: 7240
Reputation: 2136
Using perfmon check what is actually using the memory, in particular pay close attention to the 'Modified Page List Bytes' value. This can be particularly troublesome on multithreaded applications where a reference is being kept to a file for a particular length of time - the usual (temporary) resolution for high utilisation of this value is to increase the virtual memory available.
Also if running highly threaded applications on windows server 2008 you will need to apply dynacache from Microsoft to prevent the system file cache from effectively eating your available memory.
Both of the issue above can be directly related back to .net multithreaded applications processing large amounts of data, unfortunately they don't show up as being used by your application and as a result can be hard to track down (as I found out over the course of a painful few days)
Upvotes: 1
Reputation: 2683
I recently ran into a problem in one of my applications that looked very similar to this. It had to do with the amount of data being stored and used in a single "string" object. If I had to guess, your Out of Memory exception is coming from the initial assignment of
ddlValue = webClientProxy.DownloadString((string)((object[])obj)[1]);
If you can re-write it to do so, can you find a way to access the web return as a stream instead of reading the entire response into a string. You can then parse the web response by line using a stream reader.
Yes, I know this sounds very complicated, but it matches the solution I ended up having to use in my own code. I was dealing with pieces of things that were too large to store as a single string and had to access them directly from the stream instead.
Upvotes: 0
Reputation: 4330
When you use a 32bit executable you can actually allocate only 2Gb by default and not 8Gb (see here for more information: http://blogs.msdn.com/b/tom/archive/2008/04/10/chat-question-memory-limits-for-32-bit-and-64-bit-processes.aspx)
try limiting your working threads so you won't use that many and make sure you don't have a memory leak on the threads executed code.
wrap your thread execution with try... catch (if you get the OutOfMemoryException on the thread execution code) because it might be regarding the images you download
Upvotes: 0
Reputation: 23833
I would make sure that you are using the Thread Pool to 'manage' your threads. As someone has said each thread consumes around 1MB of memory, and depending on system hardware this could be causing your problem.
One potential way to broach this issue is to use the Thread Pool. This cuts the overheads incurred by spawning all your threads by sharing and recycling threads where possible. This allows low level threading facility (with many threads active) but limits the performance penalty of doing so.
The thread pool also keeps a limit on the number of worker threads (note, these will all be background thread) it will run simultaneously. Too many operational threads are a large administrative overhead and can "render the CPU cache ineffective". Once the thread pool limit that you will impose is reached, the additional jobs will be queued and execute when another worker thread becomes free. This, I feel is a much more effective, safer and resource efficient way of doing what you require.
Depending on your current code there are a number of ways to enter the thread pool:
BackgroundWorker
.ThreadPool.QueueUserWorkItem
.Personally I would use TPL as it is awesome! I hope this helps.
Upvotes: 3