Reputation: 62
So i'm trying to run multiple PDF files through a function that scrapes the text, compares it to a static dictionary , then adds it's relational data to an index table in MYSQL. I looked into multi-threading but am not sure if this would achieve what I need.
Here is the for loop where I am going through all the PDF files
for(String temp: files){
//addToDict(temp,dictonary,conn);
//new Scraper(temp,dictonary,conn).run();
Scraper obj=new Scraper(temp,dictonary,conn);
Thread T1 =new Thread(obj);
T1.start();
//System.out.println((ammountOfFiles--)+" files left");
}
And here is the Scraper class I created that implements runnable
public class Scraper implements Runnable {
private String filePath;
private HashMap<String,Integer> map;
private Connection conn;
public Scraper(String file_path,HashMap<String,Integer> dict,Connection connection) {
// store parameter for later user
filePath =file_path;
map = dict;
conn = connection;
}
@Override
public void run() {
//cut file path so it starts from the data folder
int cutPos = filePath.indexOf("Data");
String cutPath = filePath.substring(cutPos);
cutPath = cutPath.replaceAll("\\\\", "|");
System.out.println(cutPath+" being scrapped");
// Queries
String addSentanceQuery ="INSERT INTO sentance(sentance_ID,sentance_Value) VALUES(Default,?)";
String addContextQuery ="INSERT INTO context(context_ID,word_ID,sentance_ID,pdf_path) VALUES(Default,?,?,?)";
// Prepared Statementes
// RESULT SETS
ResultSet sentanceKeyRS=null;
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
FileInputStream inputstream = null;
try {
inputstream = new FileInputStream(new File(filePath));
} catch (FileNotFoundException ex) {
Logger.getLogger(Scraper.class.getName()).log(Level.SEVERE, null, ex);
}
ParseContext pcontext = new ParseContext();
//parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
try {
pdfparser.parse(inputstream, handler, metadata, pcontext);
} catch (IOException ex) {
Logger.getLogger(Scraper.class.getName()).log(Level.SEVERE, null, ex);
} catch (SAXException ex) {
Logger.getLogger(Scraper.class.getName()).log(Level.SEVERE, null, ex);
} catch (TikaException ex) {
Logger.getLogger(Scraper.class.getName()).log(Level.SEVERE, null, ex);
}
//getting the content of the document
String fileText = handler.toString();
fileText = fileText.toLowerCase();
//spilt text by new line
String sentances [] = fileText.split("\\n");
for(String x : sentances){
x = x.trim();
if(x.isEmpty() || x.matches("\\t+") || x.matches("\\n+") || x.matches("")){
}else{
int sentanceID = 0;
//add sentance to db and get the id
try (PreparedStatement addSentancePrepare = conn.prepareStatement(addSentanceQuery,Statement.RETURN_GENERATED_KEYS)) {
addSentancePrepare.setString(1, x);
addSentancePrepare.executeUpdate();
sentanceKeyRS = addSentancePrepare.getGeneratedKeys();
while (sentanceKeyRS.next()) {
sentanceID = sentanceKeyRS.getInt(1);
}
addSentancePrepare.close();
sentanceKeyRS.close();
} catch (SQLException ex) {
Logger.getLogger(Scraper.class.getName()).log(Level.SEVERE, null, ex);
}
String words [] = x.split(" ");
for(String y : words){
y = y.trim();
if(y.matches("\\s+") || y.matches("")){
}else if(map.containsKey(y)){
//get ID and put in middle table
try (PreparedStatement addContextPrepare = conn.prepareStatement(addContextQuery)) {
addContextPrepare.setInt(1, map.get(y));
addContextPrepare.setInt(2, sentanceID);
addContextPrepare.setString(3, cutPath);
addContextPrepare.executeUpdate();
addContextPrepare.close();
} catch (SQLException ex) {
Logger.getLogger(Scraper.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
}
}
try {
inputstream.close();
} catch (IOException ex) {
Logger.getLogger(Scraper.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
Am I going about this correctly? I have never used multi threading but it seems like it would speed up my program.
Upvotes: 0
Views: 1334
Reputation: 1957
You completed the basic modeling of your program. Conceptually, you got it almost right. Few concerns though.
you simply cannot increase the number of threads as you get more files to process. Even though increasing number of concurrent workers should increase the performance as we feel, in real world it might not be the case. When number of threads increases pass a certain level (depends on various parameters) actually the performance decreases.(due to thread contention, communication, memory usage). So I;m proposing you to use a ThreadPool
implementation comes with java concurrent
package. Refer to the following modification I did to your code.
public class Test {
private final ThreadPoolExecutor threadPoolExecutor;
public Test(int coreSize, int maxSize) {
this.threadPoolExecutor = new ThreadPoolExecutor(coreSize,maxSize, 50, TimeUnit.MILLISECONDS, new ArrayBlockingQueue<Runnable>(100));
}
public void submit(String[] files) {
for(String temp: files){
//addToDict(temp,dictonary,conn);
//new Scraper(temp,dictonary,conn).run();
Scraper obj=new Scraper(temp,dictonary,conn);
threadPoolExecutor.submit(obj);
//System.out.println((ammountOfFiles--)+" files left");
}
}
public void shutDown() {
this.threadPoolExecutor.shutdown();
}
}
java.sql.Connection
instance across the threads. Eventhough java.sql.Connection
is thread safe, this usage will drop your app performance significantly since java.sql.Connection
achives thread safety through synchronization. So only one thread would be able to use the connection at a time. To overcome this we can use a Connection Pooling
concept. One simple impl i could suggest is Apache Commons dbcp Upvotes: 1