Chris Ballance
Chris Ballance

Reputation: 34347

String chunking algorithm with natural language context

I have a arbitrarily large string of text from the user that needs to be split into 10k chunks (potentially adjustable value) and sent off to another system for processing.

I'm trying not to re-invent the wheel with this, any suggestions before I roll this from scratch?

Using C#.

Upvotes: 0

Views: 590

Answers (2)

Scott J
Scott J

Reputation: 1331

This may not handle every case as you need, but it should get you on your way.

    public IList<string> ChunkifyText(string bigString, int maxSize, char[] punctuation)
    {
        List<string> results = new List<string>();

        string chunk;
        int startIndex = 0;

        while (startIndex < bigString.Length)
        {
            if (startIndex + maxSize + 1 > bigString.Length)
                chunk = bigString.Substring(startIndex);
            else
                chunk = bigString.Substring(startIndex, maxSize);

            int endIndex = chunk.LastIndexOfAny(punctuation);

            if (endIndex < 0)
                endIndex = chunk.LastIndexOf(" ");

            if (endIndex < 0)
                endIndex = Math.Min(maxSize - 1, chunk.Length - 1);

            results.Add(chunk.Substring(0, endIndex + 1));

            startIndex += endIndex + 1;
        }

        return results;
    }

Upvotes: 2

MStodd
MStodd

Reputation: 4746

I'm sure this will probably end up being more difficult than you're expecting (most natural language things), but check out Sharp Natural Language Parser.

I'm currently using SharpNLP, it works pretty well, but there's always 'gotcha's'.

Let me kow if this isn't what you're looking for.

Mark

Upvotes: 1

Related Questions