Masood Ahmad
Masood Ahmad

Reputation: 741

multiple threads using same value fetched by database DAO class method

Status: solved

I had to make a pastebin as I had to point out line numbers.

note: not using executorsService or thread pools. just to understand that what is wrong in starting and using threads this way. If I use 1 thread. the app works Perfect!

related links:

http://www.postgresql.org/docs/9.1/static/transaction-iso.html http://www.postgresql.org/docs/current/static/explicit-locking.html

main app, http://pastebin.com/i9rVyari logs, http://pastebin.com/2c4pU1K8 , http://pastebin.com/2S3301gD

I am starting many threads (10) in a for loop with instantiating a runnable class but it seems I am getting same result from db (I am geting some string from db, then changing it) but with each thread, I get same string (despite each thread changed it.) . using jdbc for postgresql what might be the usual issues ?

line 252

and line 223

the link is marked as processed. (true) in db. other threads of crawler class also do it. so when line 252 should get a link. it should be processed = false. but I see all threads take same link.

when one of the threads crawled the link . it makes it processed = true. the others then should not crawl it. (get it) is its marked processed = true.


getNonProcessedLinkFromDB() returns a non processed link

public String getNonProcessedLink(){        line 645
public boolean markLinkAsProcesed(String link){   line 705

getNonProcessedLinkFromDB will see for processed = false links and give one out of them . limit 1 each thread has a starting interval gap of 20 secs.
within one thread. 1 or 2 seconds (estimate processing time for crawling)

line 98  keepS threads from grabbing the same url

if you see the result. one thread made it true. still others access it. waaaay after some time.

all thread are seperate. even one races. the db makes the link true at the moment the first thread processes it

Upvotes: 0

Views: 650

Answers (2)

Masood Ahmad
Masood Ahmad

Reputation: 741

Despite the comments and response by helpers in this post were also correct.

at the start of crawl() method body.

    synchronized(Crawler.class){
        url = getNonProcessedLinkFromDB();
        new BasicDAO().markLinkAsProcesed(url);
    }

and at the bottom of crawl() method body (when it has done processing):

    crawl(nonProcessedLinkFromDB);

actually solved the issue.

It was the gap between marking a link processed true and fetching a new one and letting other threads get the same link while the current was working on it.

Synchonized block helped further.

Thanks to helper. "Fuber" on IRC channels. Quakenet servers #java and Freenode servers ##javaee

and ALL who supported me!

Upvotes: 0

pimaster
pimaster

Reputation: 1967

This is a situation of not a concise question being asked. There is lots of code in there and you have no idea what is going on. You need to break it down so that you can understand where it is going wrong, then show us that bit.

Some things of potential conflict.

  • You are opening a database connections for almost every process. The normal flow of an application is to open a few connections, do some processing, then close them.
  • Are you handling database commits? I don't remember what the default setting is for a postres database, you'll have to look into it.
  • There are 3 states a single url is in. Unprocessed, being processed, processed. I don't think you are handling the 'being processed' state at all. Because being processed takes time and may fail, you have to account for those situations.

I did not read the logs because they are useless to me.

-edit for comment- Databases generally have transactions. Modifications you make in one transaction are not seen in other transactions until they are committed. Transaction can be rolled back. You'll need to look into fetching the row you just updated and see if the value has really changed. Do this in another transaction or on another connection.

The gap of 20 seconds looks like it is only when the process is started. Imagine a situation where Thread1 processes URL1 and Thread2 processes URL2. They both finish at about the same time. They both look for the next unprocessed URL (say URL3). They would both start processing this Url because they don't know another thread has started it. You need one process handing out the Url, possibly a queue is what you'd want to look at.

Logging might be improved if you knew which threads were working on which URLs. You also need a smaller sample size so that you can get your head around what is going on.

Upvotes: 2

Related Questions