user8622748
user8622748

Reputation:

C#: Slow inserting with Cassandra Database

I am trying to insert millions of records, possibly even billions into a Cassandra Database. Is there any faster way to do this? It's taking me 30 minute per file and I have over 100 files. I need to go through every file in a certain directory and go through all the lines, and insert every line of every file into the Cassandra Database. These files are all kind of sizes from 1KB to 300,000 KB as the maximum.

The one I am refering to is 9734KB and has been processing for 30 minutes without completion. Surely there has to be a faster way of inserting records? The file its processing has 942,345 lines.

At this rate it'll take a few days to insert all of these records.

Tried it with and without batch, both the same speed (roughly)

Console.CursorVisible = false;

var cluster = Cluster.Builder().AddContactPoints("127.0.0.1").Build();
var session = cluster.Connect("cracking");

Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine();
Console.WriteLine("  [" + DateTime.Now.ToShortTimeString() + "]" + " Connected to the Cassandra Database");
Console.WriteLine();
Console.ForegroundColor = ConsoleColor.White;

string filepath = @"C:\Users\admin\Desktop\wecrack lists\test";
DirectoryInfo directory = new DirectoryInfo(filepath);

int fileCount = 0;

foreach (var file in directory.GetFiles("*"))
{
    fileCount++;

    Console.WriteLine("  [" + DateTime.Now.ToShortTimeString() + "]" + " Working through file: {" + file + "} {" + fileCount + "/" + directory.GetFiles("*").Count() + "}");

    var lines = File.ReadLines(filepath + @"\" + file.ToString()).ToList();

    var batch = new BatchStatement();

    int lineCount = 0;

    while (lines.Count > 0)
    {
        foreach (string line in lines.ToList())
        {
            if (lineCount >= 2000)
            {
                lineCount = 0;

                Console.WriteLine("  [" + DateTime.Now.ToShortTimeString() + "]" + " Changing batch for file: {" + file + "} {" + fileCount + "/" + directory.GetFiles("*").Count() + "}");
                session.Execute(batch);
                batch = new BatchStatement();
                break;
            }

            lineCount++;
            lines.Remove(line);

            var userTrackStmt = session.Prepare("INSERT INTO passwords (id, password) VALUES (?, ?)");
            batch.Add(userTrackStmt.Bind(Guid.NewGuid(), line));
        }
    }
}

Console.WriteLine();
Console.WriteLine("  [" + DateTime.Now.ToShortTimeString() + "]" + " Finished inserting records, press any key to get the count.");
Console.ReadKey(true);

Console.WriteLine();
Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine("  " + string.Format("{0:n0}", session.Execute("SELECT * FROM passwords").Count()) + " records.");

while (true)
{
    Console.ReadKey(true);
}

Upvotes: 1

Views: 277

Answers (1)

Slugart
Slugart

Reputation: 4680

You do not need to prepare the statement each time that you use it. You should prepare it once and bind for each insert operation.

Also, you should separate out concerns as suggest by user23477763. You will be able to isolate the cost of creating many lists and deleting from the start of those lists.

You don't need to batch as batching will give you a transactional guarantee that you don't need. It's hard to know what the exact impact is without knowing knowing what your schema looks like. Have a look at https://docs.datastax.com/en/cql/3.3/cql/cql_using/useBatchBadExample.html

Also keep in mind that you can send multiple insert operations to cassandra in parallel.

Upvotes: 2

Related Questions