Reputation: 1802
I am building an index table (inverted file) on a table in MYSQL. The way it works is it extracts all the words from a file and store them to a hashset and then insert the words one by one to my database table.
It works perfectly and I am aware inverted file does take some time to build up the index table. I am trying to optimize the indexing time to the table, and I am considering using multi-threading. Will it speed up the performance?
However, I am not too sure with how to integrate it with my current program as I am new to multi-threading.
Code:
public static void main(String[] args) throws Exception {
StopWatch stopwatch = new StopWatch();
stopwatch.start();
File folder = new File("D:\\PDF1");
File[] listOfFiles = folder.listFiles();
for (File file : listOfFiles) {
if (file.isFile()) {
HashSet<String> uniqueWords = new HashSet<>();
String path = "D:\\PDF1\\" + file.getName();
try (PDDocument document = PDDocument.load(new File(path))) {
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
String[] words = line.split(" ");
for (String word : words) {
uniqueWords.add(word)
;
}
}
// System.out.println(uniqueWords);
}
} catch (IOException e) {
System.err.println("Exception while trying to read pdf document - " + e);
}
Object[] words = uniqueWords.toArray();
MysqlAccessIndex connection = new MysqlAccessIndex();
for(int i = 1 ; i <= words.length - 1 ; i++ ) {
connection.readDataBase(path, words[i].toString());
}
System.out.println("Completed");
}
}
MySQL connection:
public class MysqlAccessIndex {
public Connection connect = null;
public Statement statement = null;
public PreparedStatement preparedStatement = null;
public ResultSet resultSet = null;
public void connect() throws Exception {
// This will load the MySQL driver, each DB has its own driver
Class.forName("com.mysql.jdbc.Driver");
// Setup the connection with the DB
connect = DriverManager
.getConnection("jdbc:mysql://126.32.3.20/fulltext_ltat?"
+ "user=root&password=root");
// Statements allow to issue SQL queries to the database
statement = connect.createStatement();
System.out.print("Connected");
}
public MysqlAccessIndex() throws Exception {
connect();
}
public void readDataBase(String path,String word) throws Exception {
try {
// Result set get the result of the SQL query
// This will load the MySQL driver, each DB has its own driver
Class.forName("com.mysql.jdbc.Driver");
// Setup the connection with the DB
connect = DriverManager
.getConnection("jdbc:mysql://126.32.3.20/fulltext_ltat?"
+ "user=root&password=root");
// Statements allow to issue SQL queries to the database
statement = connect.createStatement();
System.out.print("Connected");
// Result set get the result of the SQL query
preparedStatement = connect
.prepareStatement("insert IGNORE into fulltext_ltat.indextable values (default,?, ?) ");
preparedStatement.setString( 1, path);
preparedStatement.setString(2, word);
preparedStatement.executeUpdate();
// resultSet = statement
//.executeQuery("select * from fulltext_ltat.index_detail");
// writeResultSet(resultSet);
} catch (Exception e) {
throw e;
} finally {
close();
}
}
I would appreciate any pointers.
Upvotes: 0
Views: 1601
Reputation: 103893
No, pushing data into a database with multiple threads usually does not speed anything up.
Instead, try the following:
[1] when bulk-adding data, use the bulk-add data primitives your DB engine offers. I have no idea if mysql has support for this, and how to do that from java. In postgres, for example, you'd use COPY instead of INSERT.
[2] especially if you can't use COPY or similar, turn off all indices (delete them), then do all your inserts, then add the indices, this is faster than making indices first and then inserting.
[3] Use transactions, and commit the transaction every ~100 inserts or so. THis is faster than commit-after-every-insert and also faster than commit-after-hundreds-of-thousands, in most cases.
[4] start earlier. In your example code, you can start inserting right away instead of first stuffing all data into a hashset and then adding later.
[5] don't keep making prepared statements; reuse the same one.
[6] you make a statement, twice, and do nothing with it. Don't; you're wasting resources.
[7] preparedstatements need to be closed. You aren't closing them. That's probably going to slow things down considerably. Don't make so many (just one should do), and close them when you're done with it. Search for 'ARM', which is a java construct to make it easy to close resources properly. It's over 10 years old by now.
Upvotes: 3