Reputation: 95
I'm trying to create a database (using SQLite in C#) with a table to store sentences, a table to store each individual word that is used in those sentences, and a junction table that relates the words to the sentences that they appear in. I am then trying to populate the database with over 15 million sentences. I estimate that there are about 150 million inserts happening in my code. Right now, my code is only doing a couple hundred sentences per second, which will take forever to get through this huge data set. How can I make this faster?
I tried putting the ENTIRE thing in a single transaction, but I'm not sure if that can work due to the huge amount of data. So I am using one transaction for each sentence.
Tables:
CREATE TABLE sentences ( sid INTEGER NOT NULL PRIMARY KEY, sentence TEXT );
CREATE TABLE words ( wid INTEGER NOT NULL PRIMARY KEY, dictform TEXT UNIQUE);
CREATE TABLE sentence_words( sid INTEGER NOT NULL, wid INTEGER NOT NULL, CONSTRAINT PK_sentence_words PRIMARY KEY ( sid, wid ), FOREIGN KEY(sid) REFERENCES Sentences(sid), FOREIGN KEY(wid) REFERENCES Words(wid));
Code:
while ((input = sr.ReadLine()) != null) //read in a new sentence
{
tr = m_dbConnection.BeginTransaction();
sql = "INSERT INTO sentences (sentence) VALUES(@sentence)";
cmd = new SQLiteCommand(sql, m_dbConnection);
cmd.Parameters.AddWithValue("@sentence", input);
cmd.ExecuteNonQuery();
dict_words = jutils.onlyDict(input); //convert all words to their 'dictionary form'
var words = dict_words.Split(' ');
foreach (var wd in words) //for each word
{
sql = "INSERT or IGNORE INTO words (dictform) VALUES(@dictform)";
cmd = new SQLiteCommand(sql, m_dbConnection);
cmd.Parameters.AddWithValue("@dictform", wd);
cmd.ExecuteNonQuery();
sql = "INSERT or IGNORE INTO sentence_words (sid, wid) VALUES((SELECT sid FROM sentences WHERE sentence = @sentence), (SELECT wid FROM words WHERE dictform = @dictform))";
cmd = new SQLiteCommand(sql, m_dbConnection);
cmd.Parameters.AddWithValue("@sentence", input);
cmd.Parameters.AddWithValue("@dictform", wd);
cmd.ExecuteNonQuery();
}
tr.Commit();
}
Upvotes: 1
Views: 247
Reputation: 180070
Just put the entire thing in a single transaction. (The rollback journal stores the old version of any page changed by the transaction, so for an empty database, it cannot become big.)
Furthermore, all the searches for the sentence
value are inefficient because the database has to go through the entire table. If that column had an index (explicitly, or implicitly with a UNIQUE constraint), these lookups would be much faster.
Upvotes: 0
Reputation:
You can group by a certain number of lines
int lines = -1;
int group = 500;
while ((input = sr.ReadLine()) != null)
{
lines++;
if (lines%group == 0) {
tr = m_dbConnection.BeginTransaction();
}
and commit each group
if (lines%group == group-1) {
tr.Commit();
}
}
if (lines%group >= 0 && lines%group < group-1) {
tr.Commit();
}
I had some old cases where I was forced to use .Net 4.0 and BulkCopy was not an option due to its async implementation.
Upvotes: 0
Reputation: 952
Always, we must avoid the 'one by one' SQL tasks when dealing with such large data.
In my case, (if memory is not burdened), load the data into a DataTable and manipulate(with LINQ) as you need, and finally use SqlBulkCopy at the end.
There's SqlBulkUpdate also but created by private author supporting from SQL 2008. If under 2008, we can still quickly do this but have to create temporary SQL table and use UPDATE Join command.
SqlBulkCopy is really fast like just seconds.
Upvotes: 2