Rich
Rich

Reputation: 2174

How to extract phrases and then words in a string of text?

I have a search method that takes in a user-entered string, splits it at each space character and then proceeds to find matches based on the list of separated terms:

string[] terms = searchTerms.ToLower().Trim().Split( ' ' );

Now I have been given a further requirement: to be able to search for phrases via double quote delimiters a la Google. So if the search terms provided were:

"a line of" text

The search would match occurrences of "a line of" and "text" rather than the four separate terms [the open and closing double quotes would also need to be removed before searching].

How can I achieve this in C#? I would assume regular expressions would be the way to go, but haven't dabbled in them much so don't know if they are the best solution.

If you need any more info, please ask. Thanks in advance for the help.

Upvotes: 2

Views: 3177

Answers (5)

GorvGoyl
GorvGoyl

Reputation: 49680

Try this, It'll return an array for text. ex: { "a line of" text "notepad" }:

string textToSearch = "\"a line of\" text \" notepad\"";

MatchCollection allPhrases = Regex.Matches(textToSearch, "(?<=\").*?(?=\")");

var RegArray = allPhrases.Cast<Match>().ToArray();

output: {"a line of","text"," notepad" }

Upvotes: 0

Drew Noakes
Drew Noakes

Reputation: 311325

Here's a regex pattern that would return matches in groups named 'term':

("(?<term>[^"]+)"\s*|(?<term>[^ ]+)\s*)+

So for the input:

"a line" of text

The output items identified by the 'term' group would be:

a line
of
text

Upvotes: 2

Rob Cowell
Rob Cowell

Reputation: 1628

The Knuth-Morris-Pratt (KMP algorithm)is recognised as the fastest algorithm for finding substrings in strings (well, technically not strings but byte-arrays).

using System.Collections.Generic;

namespace KMPSearch
{
    public class KMPSearch
    {
        public static int NORESULT = -1;

        private string _needle;
        private string _haystack;
        private int[] _jumpTable;

        public KMPSearch(string haystack, string needle)
        {
            Haystack = haystack;
            Needle = needle;
        }

        public void ComputeJumpTable()
        {
            //Fix if we are looking for just one character...
            if (Needle.Length == 1)
            {
                JumpTable = new int[1] { -1 };
            }
            else
            {
                int needleLength = Needle.Length;
                int i = 2;
                int k = 0;

                JumpTable = new int[needleLength];
                JumpTable[0] = -1;
                JumpTable[1] = 0;

                while (i <= needleLength)
                {
                    if (i == needleLength)
                    {
                        JumpTable[needleLength - 1] = k;
                    }
                    else if (Needle[k] == Needle[i])
                    {
                        k++;
                        JumpTable[i] = k;
                    }
                    else if (k > 0)
                    {
                        JumpTable[i - 1] = k;
                        k = 0;
                    }

                    i++;
                }
            }
        }

        public int[] MatchAll()
        {
            List<int> matches = new List<int>();
            int offset = 0;
            int needleLength = Needle.Length;
            int m = Match(offset);

            while (m != NORESULT)
            {
                matches.Add(m);
                offset = m + needleLength;
                m = Match(offset);
            }

            return matches.ToArray();
        }

        public int Match()
        {
            return Match(0);
        }

        public int Match(int offset)
        {
            ComputeJumpTable();

            int haystackLength = Haystack.Length;
            int needleLength = Needle.Length;

            if ((offset >= haystackLength) || (needleLength > ( haystackLength - offset))) 
                return NORESULT;

            int haystackIndex = offset;
            int needleIndex = 0;

            while (haystackIndex < haystackLength)
            {
                if (needleIndex >= needleLength)
                    return haystackIndex;

                if (haystackIndex + needleIndex >= haystackLength)
                    return NORESULT;

                if (Haystack[haystackIndex + needleIndex] == Needle[needleIndex])
                {
                    needleIndex++;
                } 
                    else
                {
                    //Naive solution
                    haystackIndex += needleIndex;

                    //Go back
                    if (needleIndex > 1)
                    {
                        //Index of the last matching character is needleIndex - 1!
                        haystackIndex -= JumpTable[needleIndex - 1];
                        needleIndex = JumpTable[needleIndex - 1];
                    }
                    else
                        haystackIndex -= JumpTable[needleIndex];


                }
            }

            return NORESULT;
        }

        public string Needle
        {
            get { return _needle; }
            set { _needle = value; }
        }

        public string Haystack
        {
            get { return _haystack; }
            set { _haystack = value; }
        }

        public int[] JumpTable
        {
            get { return _jumpTable; }
            set { _jumpTable = value; }
        }
    }
}

Usage :-

using System;
using System.Collections.Generic;
using System.Text;
using System.Reflection;
namespace KMPSearch
{
    class Program
    {
        static void Main(string[] args)
        {
            if (args.Length != 2)
            {
                Console.WriteLine("Usage: " + Environment.GetCommandLineArgs()[0] + " haystack needle");
            }
            else
            {
                KMPSearch search = new KMPSearch(args[0], args[1]);
                int[] matches = search.MatchAll();
                foreach (int i in matches)
                    Console.WriteLine("Match found at position " + i+1);
            }
        }

    }
}

Upvotes: 0

xoxo
xoxo

Reputation: 577

Use Regexs....

string textToSearchIn = ""a line of" text";
string result = Regex.Match(textToSearchIn, "(?<=").*?(?=")").Value;

or if more then one, put this into a match collection...

MatchCollection allPhrases = Regex.Matches(textToSearchIn, "(?<=").*?(?=")");

Upvotes: 1

Robban
Robban

Reputation: 6802

Regular expressions would definitely be the way to go...

You should check this MSDN link out for some info on the Regex class: http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx

and here is an excellent link to learn some regular expression syntax: http://www.radsoftware.com.au/articles/regexlearnsyntax.aspx

Then to add some code examples, you could be doing it something along these lines:

string searchString = "a line of";

Match m = Regex.Match(textToSearch, searchString);

or if you just want to find out if the string contains a match or not:

bool success = Regex.Match(textToSearch, searchString).Success;

Upvotes: 1

Related Questions