Lothre1
Lothre1

Reputation: 3853

Extracting Titles from strings with RegEx

I'm facing a problem caused by having to extract titles of programs from small pieces of strings whose structure can't be predicted at all. There are some patterns like you can see below, and each string must be evaluated to see if it matches any of those structures to get me able to properly get the title.

I've bought Mastering Regular Expressions but the time that I have to accomplish this doesn't allow me to be studing the book and trying to get the necessary introduction to this (interesting but particular) Theme.

Perharps, someone experienced in this area could help me to understand how to accomplish this job?

Some random Name 2 - Ep.1   
=> Some random Name 2

Some random Name - Ep.1 
=> Some random Name

Boff another 2 name! - Ep. 228 
=> Boff another 2 name!     

Another one & the rest - T1 Ep. 2 
=>Another one & the rest

T5 - Ep. 2 Another Name     
=> Another Name 

T3 - Ep. 3 - One More with an Hyfen  
=> One More with an Hyfen

Another one this time with a Date - 02/12/2012   
=>Another one this time with a Date

10 Aug 2012 - Some Other 2 - Ep. 2 
=> Some Other 2

Ep. 93 -  Some program name
=> Some Program name    
Someother random name - Epis. 1 e 2
=> Someother random name

The Last one with something inside parenthesis (V.O.)
=> The Last one with something inside parenthesis

As you may see the titles that I want to extract from the given string may have Numbers, special characters like &, and characters from a-zA-Z (i guess that's all)

The complex part comes when having to know if it has one space or more after the title and is followed by a hyphen and if it haves zero or more spaces until Ep. (i can't explain this, it's just complex.)

Upvotes: 1

Views: 650

Answers (2)

mortb
mortb

Reputation: 9859

This program will handle your cases. The main principle is that it removes a certain sequence if present in the beginnign or the end of the string. You'll have to maintain the list of regular expressions if the format of the strings you want to remove will change or change the order of them as needed.

   using System;
   using System.Text.RegularExpressions;

    public class MyClass
    {


        static string [] strs = 
        {       
               "Some random Name 2 - Ep.1",
               "Some random Name - Ep.1",
               "Boff another 2 name! - Ep. 228",
               "Another one & the rest - T1 Ep. 2",
               "T5 - Ep. 2 Another Name",
               "T3 - Ep. 3 - One More with an Hyfen",
               @"Another one this time with a Date - 02/12/2012",
               "10 Aug 2012 - Some Other 2 - Ep. 2",
               "Ep. 93 -  Some program name",
               "Someother random name - Epis. 1 e 2",
               "The Last one with something inside parenthesis (V.O.)"};

        static string [] regexes = 
        {
            @"T\d+",
            @"\-",
            @"Ep(i(s(o(d(e)?)?)?)?)?\s*\.?\s*\d+(\s*e\s*\d+)*",
            @"\d{2}\/\d{2}\/\d{2,4}",
            @"\d{2}\s*[A-Z]{3}\s*\d{4}",
            @"T\d+",
            @"\-",
            @"\!",
            @"\(.+\)",
        };

        public static void Main()
        {
            foreach(var str in strs)
            {
                string cleaned = str.Trim();
                foreach(var cleaner in regexes)
                {
                    cleaned = Regex.Replace(cleaned, "^" + cleaner, string.Empty, RegexOptions.IgnoreCase).Trim();  
                    cleaned = Regex.Replace(cleaned, cleaner + "$", string.Empty, RegexOptions.IgnoreCase).Trim();
                }
                Console.WriteLine(cleaned);
            }
            Console.ReadKey();
        }

Upvotes: 1

Nolonar
Nolonar

Reputation: 6122

If it's only about checking for patterns, and not actually extracting the title name, let me have a go:

With @"Ep(is)?\.?\s*\d+" you can check for strings such as "Ep1", "Ep01", "Ep.999", "Ep3", "Epis.0", "Ep 11" and similar (it also detects multiple whitespaces between Ep and the numeral). You may want to use the RegexOptions.IgnoreCase in case you want to match "ep1" as well as "Ep1" or "EP1"

If you are certain, that no name will include a "-" and that this character separates name from episode-info, you can try to split the string like this:

string[] splitString = inputString.Split(new char[] {'-'});
foreach (string s in splitString)
{
    s.Trim() // removes all leading or trailing whitespaces
}

You'll have the name in either splitString[0] or splitString[1] and the episode-info in the other.

To search for dates, you can use this: @"\d{1,4}(\\|/|.|,)\d{1,2}(\\|/|.|,)\d{1,4}" which can detect dates with the year to the front or the back written with 1 to 4 decimals (except for the center value, which can be 1 to 2 decimals long) and separated with a back-slash, a slash, a comma or a dot.

Like I mentioned before: this will not allow your program to extract the actual title, only to find out if such strings exist (those strings may still be part of the title itself)

Edit:

A way to get rid of multiple whitespaces is to use inputString = Regex.Replace(inputString, "\s+", " ") which replaces multiple whitespaces with a single whitespace. Maybe you have underscores instead of whitespaces? Such as: "This_is_a_name", in which case you might want to use inputString = Regex.Replace(inputString, "_+", " ") before removing the multiple whitespaces.

Upvotes: 0

Related Questions