Reputation: 137544
How to split text into words?
Example text:
'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'
The words in that line are:
Upvotes: 32
Views: 59172
Reputation: 77
If you want to use the "for cycle" to check each char and save all punctuation in the input string I've create this class. The method GetSplitSentence() return a list of SentenceSplitResult. In this list there are saved all the words and all the punctuation & numbers. Each punctuation or numbers saved is an item in the list. The sentenceSplitResult.isAWord is used to check if is a word or not. [Sorry for my English]
public class SentenceSplitResult
{
public string word;
public bool isAWord;
}
public class StringsHelper
{
private readonly List<SentenceSplitResult> outputList = new List<SentenceSplitResult>();
private readonly string input;
public StringsHelper(string input)
{
this.input = input;
}
public List<SentenceSplitResult> GetSplitSentence()
{
StringBuilder sb = new StringBuilder();
try
{
if (String.IsNullOrEmpty(input)) {
Logger.Log(new ArgumentNullException(), "GetSplitSentence - input is null or empy");
return outputList;
}
bool isAletter = IsAValidLetter(input[0]);
// Each char i checked if is a part of a word.
// If is YES > I can store the char for later
// IF is NO > I Save the word (if exist) and then save the punctuation
foreach (var _char in input)
{
isAletter = IsAValidLetter(_char);
if (isAletter == true)
{
sb.Append(_char);
}
else
{
SaveWord(sb.ToString());
sb.Clear();
SaveANotWord(_char);
}
}
SaveWord(sb.ToString());
}
catch (Exception ex)
{
Logger.Log(ex);
}
return outputList;
}
private static bool IsAValidLetter(char _char)
{
if ((Char.IsPunctuation(_char) == true) || (_char == ' ') || (Char.IsNumber(_char) == true))
{
return false;
}
return true;
}
private void SaveWord(string word)
{
if (String.IsNullOrEmpty(word) == false)
{
outputList.Add(new SentenceSplitResult()
{
isAWord = true,
word = word
});
}
}
private void SaveANotWord(char _char)
{
outputList.Add(new SentenceSplitResult()
{
isAWord = false,
word = _char.ToString()
});
}
Upvotes: 2
Reputation: 5961
First, Remove all special characeters:
var fixedInput = Regex.Replace(input, "[^a-zA-Z0-9% ._]", string.Empty);
// This regex doesn't support apostrophe so the extension method is better
Then split it:
var split = fixedInput.Split(' ');
For a simpler C# solution for removing special characters (that you can easily change), add this extension method (I added a support for an apostrophe):
public static string RemoveSpecialCharacters(this string str) {
var sb = new StringBuilder();
foreach (char c in str) {
if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '\'' || c == ' ') {
sb.Append(c);
}
}
return sb.ToString();
}
Then use it like so:
var words = input.RemoveSpecialCharacters().Split(' ');
You'll be surprised to know that this extension method is very efficient (surely much more efficient then the Regex) so I'll suggest you use it ;)
Update
I agree that this is an English only approach but to make it Unicode compatible all you have to do is replace:
(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
With:
char.IsLetter(c)
Which supports Unicode, .Net Also offers you char.IsSymbol
and char.IsLetterOrDigit
for the variety of cases
Upvotes: 26
Reputation: 27926
Just to add a variation on @Adam Fridental's answer which is very good, you could try this Regex:
var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
var matches = Regex.Matches(text, @"\w+[^\s]*\w+|\w");
foreach (Match match in matches) {
var word = match.Value;
}
I believe this is the shortest RegEx that will get all the words
\w+[^\s]*\w+|\w
Upvotes: 8
Reputation: 137544
Split text on whitespace, then trim punctuation.
var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
var punctuation = text.Where(Char.IsPunctuation).Distinct().ToArray();
var words = text.Split().Select(x => x.Trim(punctuation));
Agrees exactly with example.
Upvotes: 60
Reputation: 485
This is one of solution, i dont use any helper class or method.
public static List<string> ExtractChars(string inputString) {
var result = new List<string>();
int startIndex = -1;
for (int i = 0; i < inputString.Length; i++) {
var character = inputString[i];
if ((character >= 'a' && character <= 'z') ||
(character >= 'A' && character <= 'Z')) {
if (startIndex == -1) {
startIndex = i;
}
if (i == inputString.Length - 1) {
result.Add(GetString(inputString, startIndex, i));
}
continue;
}
if (startIndex != -1) {
result.Add(GetString(inputString, startIndex, i - 1));
startIndex = -1;
}
}
return result;
}
public static string GetString(string inputString, int startIndex, int endIndex) {
string result = "";
for (int i = startIndex; i <= endIndex; i++) {
result += inputString[i];
}
return result;
}
Upvotes: 1
Reputation: 32694
If you don't want to use a Regex object, you could do something like...
string mystring="Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.";
List<string> words=mystring.Replace(",","").Replace(":","").Replace(".","").Split(" ").ToList();
You'll still have to handle the trailing apostrophe at the end of "that,'"
Upvotes: 1
Reputation: 69362
You could try using a regex to remove the apostrophes that aren't surrounded by letters (i.e. single quotes) and then using the Char
static methods to strip all the other characters. By calling the regex first you can keep the contraction apostrophes (e.g. can't
) but remove the single quotes like in 'Oh
.
string myText = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
Regex reg = new Regex("\b[\"']\b");
myText = reg.Replace(myText, "");
string[] listOfWords = RemoveCharacters(myText);
public string[] RemoveCharacters(string input)
{
StringBuilder sb = new StringBuilder();
foreach (char c in input)
{
if (Char.IsLetter(c) || Char.IsWhiteSpace(c) || c == '\'')
sb.Append(c);
}
return sb.ToString().Split(' ');
}
Upvotes: 0