Multi threading multiple pdf files

Question

So i'm trying to run multiple PDF files through a function that scrapes the text, compares it to a static dictionary , then adds it's relational data to an index table in MYSQL. I looked into multi-threading but am not sure if this would achieve what I need.

Here is the for loop where I am going through all the PDF files

for(String temp: files){
    //addToDict(temp,dictonary,conn);
    //new Scraper(temp,dictonary,conn).run();
    Scraper obj=new Scraper(temp,dictonary,conn);  
    Thread T1 =new Thread(obj);  
    T1.start();  

    //System.out.println((ammountOfFiles--)+" files left");
}

And here is the Scraper class I created that implements runnable

public  class Scraper implements Runnable {

    private String filePath;
    private HashMap map;
    private Connection conn;

    public Scraper(String file_path,HashMap dict,Connection connection) {
       // store parameter for later user
       filePath =file_path;
       map = dict;
       conn = connection;
   }

    @Override
    public void run() {
        //cut file path so it starts from the data folder 
        int cutPos = filePath.indexOf("Data");
        String cutPath = filePath.substring(cutPos);
        cutPath = cutPath.replaceAll("\\", "|");

        System.out.println(cutPath+" being scrapped");

        // Queries
        String addSentanceQuery ="INSERT INTO sentance(sentance_ID,sentance_Value) VALUES(Default,?)";
        String addContextQuery ="INSERT INTO context(context_ID,word_ID,sentance_ID,pdf_path) VALUES(Default,?,?,?)";

        // Prepared Statementes

        // RESULT SETS
        ResultSet sentanceKeyRS=null;

        BodyContentHandler handler = new BodyContentHandler(-1);
        Metadata metadata = new Metadata();
        FileInputStream inputstream = null;
        try {
            inputstream = new FileInputStream(new File(filePath));
        } catch (FileNotFoundException ex) {
            Logger.getLogger(Scraper.class.getName()).log(Level.SEVERE, null, ex);
        }
        ParseContext pcontext = new ParseContext();

        //parsing the document using PDF parser
        PDFParser pdfparser = new PDFParser();
        try {
            pdfparser.parse(inputstream, handler, metadata, pcontext);
        } catch (IOException ex) {
            Logger.getLogger(Scraper.class.getName()).log(Level.SEVERE, null, ex);
        } catch (SAXException ex) {
            Logger.getLogger(Scraper.class.getName()).log(Level.SEVERE, null, ex);
        } catch (TikaException ex) {
            Logger.getLogger(Scraper.class.getName()).log(Level.SEVERE, null, ex);
        }

        //getting the content of the document
        String fileText = handler.toString();

        fileText = fileText.toLowerCase();

        //spilt text by new line

        String sentances [] = fileText.split("\n");

        for(String x : sentances){
            x = x.trim();
            if(x.isEmpty() || x.matches("\t+") || x.matches("\n+") || x.matches("")){

            }else{
                int sentanceID = 0;
                //add sentance to db and get the id
                try (PreparedStatement addSentancePrepare = conn.prepareStatement(addSentanceQuery,Statement.RETURN_GENERATED_KEYS)) {
                    addSentancePrepare.setString(1, x);
                    addSentancePrepare.executeUpdate();
                    sentanceKeyRS = addSentancePrepare.getGeneratedKeys();
                    while (sentanceKeyRS.next()) {
                        sentanceID = sentanceKeyRS.getInt(1);
                    }
                    addSentancePrepare.close();
                    sentanceKeyRS.close();
                } catch (SQLException ex) {
                    Logger.getLogger(Scraper.class.getName()).log(Level.SEVERE, null, ex);
                }

                String words [] = x.split(" ");

                for(String y : words){
                    y = y.trim();
                    if(y.matches("\s+") || y.matches("")){

                    }else if(map.containsKey(y)){

                        //get ID and put in middle table
                        try (PreparedStatement addContextPrepare = conn.prepareStatement(addContextQuery)) {
                            addContextPrepare.setInt(1, map.get(y));
                            addContextPrepare.setInt(2, sentanceID);
                            addContextPrepare.setString(3, cutPath);
                            addContextPrepare.executeUpdate();


                            addContextPrepare.close();

                        } catch (SQLException ex) {
                            Logger.getLogger(Scraper.class.getName()).log(Level.SEVERE, null, ex);
                        }

                    }              
                }            
            }    
        }


        try {
            inputstream.close();
        } catch (IOException ex) {
            Logger.getLogger(Scraper.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

}

Am I going about this correctly? I have never used multi threading but it seems like it would speed up my program.

Sudheera · Accepted Answer

You completed the basic modeling of your program. Conceptually, you got it almost right. Few concerns though.

Scalability

you simply cannot increase the number of threads as you get more files to process. Even though increasing number of concurrent workers should increase the performance as we feel, in real world it might not be the case. When number of threads increases pass a certain level (depends on various parameters) actually the performance decreases.(due to thread contention, communication, memory usage). So I;m proposing you to use a ThreadPool implementation comes with java concurrent package. Refer to the following modification I did to your code.

public class Test {

    private final ThreadPoolExecutor threadPoolExecutor;

    public Test(int coreSize, int maxSize) {
        this.threadPoolExecutor = new ThreadPoolExecutor(coreSize,maxSize, 50, TimeUnit.MILLISECONDS, new ArrayBlockingQueue(100));
    }


    public void submit(String[] files) {
        for(String temp: files){
            //addToDict(temp,dictonary,conn);
            //new Scraper(temp,dictonary,conn).run();
            Scraper obj=new Scraper(temp,dictonary,conn);
            threadPoolExecutor.submit(obj);

            //System.out.println((ammountOfFiles--)+" files left");
        }
    }

    public void shutDown() {
        this.threadPoolExecutor.shutdown();
    }
}

Thread safety and Synchronization I can see you have shared the java.sql.Connection instance across the threads. Eventhough java.sql.Connection is thread safe, this usage will drop your app performance significantly since java.sql.Connection achives thread safety through synchronization. So only one thread would be able to use the connection at a time. To overcome this we can use a Connection Pooling concept. One simple impl i could suggest is Apache Commons dbcp

Multi threading multiple pdf files

Answers (1)

Related Questions