Reputation: 32758
I have snippets of text and I would like to divide these into lines. The problem is that they have been formatted and so I cannot split like I would normally do which is this way:
_text = text.Split(new[] { '\n' }, StringSplitOptions.RemoveEmptyEntries)
.ToArray();
Here is the sample text:
adj 1: around the middle of a scale of evaluation of physical
measures; "an orange of average size"; "intermediate
capacity"; "a plane with intermediate range"; "medium
bombers" [syn: {average}, {intermediate}]
2: (of meat) cooked until there is just a little pink meat
inside
n 1: a means or instrumentality for storing or communicating
information
2: the surrounding environment; "fish require an aqueous
medium"
3: an intervening substance through which signals can travel as
a means for communication
4: (bacteriology) a nutrient substance (solid or liquid) that
is used to cultivate micro-organisms [syn: {culture medium}]
5: an intervening substance through which something is
achieved; "the dissolving medium is called a solvent"
6: a liquid with which pigment is mixed by a painter
7: (biology) a substance in which specimens are preserved or
displayed
8: a state that is intermediate between extremes; a middle
position; "a happy medium"
The format is always the same:
So in this case the line break would have to be something like the 1-3 char word followed by a 1-2 character number followed by a :
Can someone give me some advice on how I could do this with the split or with another method?
Update: Steven's answer but not quite sure how to fit that in my function. Here I show my original code and below that Steven's suggested answer but there is a part missing that I am not sure about:
public parser(string text)
{
//_text = text.Split(new[] { '\n' }, StringSplitOptions.RemoveEmptyEntries)
// .ToArray();
string pattern = @"(\w{1,3} )?1?\d: (?<line>[^\r\n]+)(\r?\n\s+(?<line>[^\r\n]+))*";
foreach (Match m in Regex.Matches(text, pattern))
{
if (m.Success)
{
string entry = string.Join(Environment.NewLine,
m.Groups["line"].Captures.Cast<Capture>().Select(x => x.Value));
// ...
}
}
}
For testing purposes here is the text in a different format:
"medium\n adj 1: around the middle of a scale of evaluation of physical\n measures; \"an orange of average size\"; \"intermediate\n capacity\"; \"a plane with intermediate range\"; \"medium\n bombers\" [syn: {average}, {intermediate}]\n 2: (of meat) cooked until there is just a little pink meat\n inside\n n 1: a means or instrumentality for storing or communicating\n information\n 2: the surrounding environment; \"fish require an aqueous\n medium\"\n 3: an intervening substance through which signals can travel as\n a means for communication\n 4: (bacteriology) a nutrient substance (solid or liquid) that\n is used to cultivate micro-organisms [syn: {culture medium}]\n 5: an intervening substance through which something is\n achieved; \"the dissolving medium is called a solvent\"\n 6: a liquid with which pigment is mixed by a painter\n 7: (biology) a substance in which specimens are preserved or\n displayed\n 8: a state that is intermediate between extremes; a middle\n position; \"a happy medium\"\n 9: someone who serves as an intermediary between the living and\n the dead; \"he consulted several mediums\" [syn: {spiritualist}]\n 10: transmissions that are disseminated widely to the public\n [syn: {mass medium}]\n 11: an occupation for which you are especially well suited; \"in\n law he found his true metier\" [syn: {metier}]\n [also: {media} (pl)]\n"
Upvotes: 4
Views: 315
Reputation: 43743
Regex works nicely for this. For instance:
public parser(string text)
{
string pattern = @"(?<line> (\w{1,3} )?1?\d: [^\r\n]+)(\r?\n(?! (\w{1,3} )?1?\d: [^\r\n]+)\s+(?<line>[^\r\n]+))*";
var entries = new List<string>();
foreach (Match m in Regex.Matches(text, pattern))
if(m.Success)
entries.Add(string.Join(" ",
m.Groups["line"].Captures.Cast<Capture>().Select(x=>x.Value)));
_text = entries.ToArray();
}
Upvotes: 2
Reputation: 34421
Try this
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
namespace ConsoleApplication106
{
class Program
{
const string FILENAME = @"c:\temp\test.txt";
static void Main(string[] args)
{
string inputLine = "";
List<Data> data = new List<Data>();
string pattern = @"(?'prefix'\w*)?\s*?(?'index'\d+):(?'text'.*)";
StreamReader reader = new StreamReader(FILENAME);
while ((inputLine = reader.ReadLine()) != null)
{
inputLine = inputLine.Trim();
Match match = Regex.Match(inputLine, pattern);
Data newData = new Data();
data.Add(newData);
newData.prefix = match.Groups["prefix"].Value;
newData.index = int.Parse(match.Groups["index"].Value);
newData.text = match.Groups["text"].Value;
}
}
}
public class Data
{
public string prefix { get; set; }
public int index { get; set; }
public string text { get; set; }
}
}
Upvotes: 2