Samantha J T Star
Samantha J T Star

Reputation: 32758

How can I split text into lines based on a regex expression?

I have snippets of text and I would like to divide these into lines. The problem is that they have been formatted and so I cannot split like I would normally do which is this way:

 _text = text.Split(new[] { '\n' }, StringSplitOptions.RemoveEmptyEntries)
            .ToArray();

Here is the sample text:

 adj 1: around the middle of a scale of evaluation of physical
        measures; "an orange of average size"; "intermediate
        capacity"; "a plane with intermediate range"; "medium
        bombers" [syn: {average}, {intermediate}]
 2: (of meat) cooked until there is just a little pink meat
    inside
 n 1: a means or instrumentality for storing or communicating
      information
 2: the surrounding environment; "fish require an aqueous
    medium"
 3: an intervening substance through which signals can travel as
    a means for communication
 4: (bacteriology) a nutrient substance (solid or liquid) that
    is used to cultivate micro-organisms [syn: {culture medium}]
 5: an intervening substance through which something is
    achieved; "the dissolving medium is called a solvent"
 6: a liquid with which pigment is mixed by a painter
 7: (biology) a substance in which specimens are preserved or
    displayed
 8: a state that is intermediate between extremes; a middle
    position; "a happy medium"

The format is always the same:

So in this case the line break would have to be something like the 1-3 char word followed by a 1-2 character number followed by a :

Can someone give me some advice on how I could do this with the split or with another method?

Update: Steven's answer but not quite sure how to fit that in my function. Here I show my original code and below that Steven's suggested answer but there is a part missing that I am not sure about:

    public parser(string text)
    {
        //_text = text.Split(new[] { '\n' }, StringSplitOptions.RemoveEmptyEntries)
            // .ToArray();

        string pattern = @"(\w{1,3} )?1?\d: (?<line>[^\r\n]+)(\r?\n\s+(?<line>[^\r\n]+))*";
        foreach (Match m in Regex.Matches(text, pattern))
        {
            if (m.Success)
            {
                string entry = string.Join(Environment.NewLine,
                    m.Groups["line"].Captures.Cast<Capture>().Select(x => x.Value));
                // ...
            }
        }
    }

For testing purposes here is the text in a different format:

"medium\n adj 1: around the middle of a scale of evaluation of physical\n measures; \"an orange of average size\"; \"intermediate\n capacity\"; \"a plane with intermediate range\"; \"medium\n bombers\" [syn: {average}, {intermediate}]\n 2: (of meat) cooked until there is just a little pink meat\n inside\n n 1: a means or instrumentality for storing or communicating\n information\n 2: the surrounding environment; \"fish require an aqueous\n medium\"\n 3: an intervening substance through which signals can travel as\n a means for communication\n 4: (bacteriology) a nutrient substance (solid or liquid) that\n is used to cultivate micro-organisms [syn: {culture medium}]\n 5: an intervening substance through which something is\n achieved; \"the dissolving medium is called a solvent\"\n 6: a liquid with which pigment is mixed by a painter\n 7: (biology) a substance in which specimens are preserved or\n displayed\n 8: a state that is intermediate between extremes; a middle\n position; \"a happy medium\"\n 9: someone who serves as an intermediary between the living and\n the dead; \"he consulted several mediums\" [syn: {spiritualist}]\n 10: transmissions that are disseminated widely to the public\n [syn: {mass medium}]\n 11: an occupation for which you are especially well suited; \"in\n law he found his true metier\" [syn: {metier}]\n [also: {media} (pl)]\n"

Upvotes: 4

Views: 315

Answers (2)

Steven Doggart
Steven Doggart

Reputation: 43743

Regex works nicely for this. For instance:

public parser(string text)
{
    string pattern = @"(?<line> (\w{1,3} )?1?\d: [^\r\n]+)(\r?\n(?! (\w{1,3} )?1?\d: [^\r\n]+)\s+(?<line>[^\r\n]+))*";
    var entries = new List<string>();
    foreach (Match m in Regex.Matches(text, pattern))
        if(m.Success)
            entries.Add(string.Join(" ", 
                m.Groups["line"].Captures.Cast<Capture>().Select(x=>x.Value)));
    _text = entries.ToArray();
}

Upvotes: 2

jdweng
jdweng

Reputation: 34421

Try this

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;

namespace ConsoleApplication106
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.txt";
        static void Main(string[] args)
        {
            string inputLine = "";
            List<Data> data = new List<Data>();
            string pattern = @"(?'prefix'\w*)?\s*?(?'index'\d+):(?'text'.*)";
            StreamReader reader = new StreamReader(FILENAME);
            while ((inputLine = reader.ReadLine()) != null)
            {
                inputLine = inputLine.Trim();
                Match match = Regex.Match(inputLine, pattern);
                Data newData = new Data();
                data.Add(newData);
                newData.prefix = match.Groups["prefix"].Value;
                newData.index = int.Parse(match.Groups["index"].Value);
                newData.text = match.Groups["text"].Value;
            }
        }
    }
    public class Data
    {
        public string prefix { get; set; }
        public int index { get; set; }
        public string text { get; set; }
    }
}

Upvotes: 2

Related Questions