Joan Venge
Joan Venge

Reputation: 330992

Is there a way to get a string up until a year value?

Basically I have some filenames where there is a year in the middle. I am only interested in getting any letter or number up until the year value, but only letters and numbers, not commas, dots, underscores, etc. Is it possible? Maybe with Regex?

For instance:

"A-Good-Life-2010-For-Archive"
"Any.Chararacter_Can+Come.Before!2011-RedundantInfo"
"WhatyouseeIsWhatUget.2012-Not"
"400-Gestures.In1.2000-Communication"

where I want:

"AGoodLife"
"AnyChararacterCanComeBefore"
"WhatyouseeIsWhatUget"
"400GesturesIn1"

By numbers I mean any number that doesn't look like a year, i.e. 1 digit, 2 digits, 3 digits, 5 digits, and so on. I only want to recognize 4 digit numbers as years.

Upvotes: 1

Views: 127

Answers (5)

mathematical.coffee
mathematical.coffee

Reputation: 56915

You'll have to do this in two parts -- first to remove the symbols you don't want, and second to grab everything up to the year (or vice versa).

To do grab everything up to the year, you can use:

Match match = Regex.Match(movieTitle,@"(.*)(?<!\d)(?:19|20)[0-9]{2}(?!\d)");
// if match.Success, result is in match.Groups[1].value

I've made the year regex so it only matches things in the 1900s or 2000s, to make sure you don't match four-digit numbers as year if they're not a year (e.g. "Ali-Baba-And-the-1234-Thieves.2011").

However, if your movie title involves a year, then this won't really work ("2001:-Space-Odyssey(1968)").

To then replace all the non-characters, you can replace "[^a-zA-Z0-9]" with "". (I've allowed digits because a movie might have legitimate numbers in the title).

UPDATED from comments below:

  • if you search from the end to find the year you might do better. ie find the latest occuring year-candidate as the year. Hence, I've changed a .*? to .* in the regex so that the title is as greedy as possible and only uses the last year-candidate as the year.
  • Added a (?!\d) to the end of the year regex and a (?<!\d) to the start so that it doesn't match "My-title-1" instead of "My-title-120012-fdsa" & "2001" in "My-title-120012-fdsa" (I didn't add the boundary \b because the title might be "A-Good-Life2010" which has no boundary around the year).
  • changed the string to a raw string (@"...") so I don't need to worry about escaping backslashes in the regex because of C# interpreting backslashes.

Upvotes: 1

Mawg
Mawg

Reputation: 40140

I suppose you want a fancy regular excpression?

Why not a simple for loop?

digitCount = 0;
for i = 0 to strlen(filename)
{
  if isdigit(fielname[i])
  {
     digitCount++;
     if digitCount == 4
        thePartOfTheFileNameThatYouWant = strcpy(filename, 0, i-4)
  }
  else digitCount = 0;     
}

// Sorry, I don't know C-sharp

Upvotes: 0

millimoose
millimoose

Reputation: 39950

You can use Regex.Split() to make the code ever so terser (and possibly faster due to the simpler regex):

var str = "400-Gestures.In1.2000-Communication";

var re = new Regex(@"(^|\D)\d{4}(\D|$)");
var start = re.Split(str)[0];

// remove nonalphanumerics
var result = new string(start.Where(c=>Char.IsLetterOrDigit(c)).ToArray());

Upvotes: 1

sblom
sblom

Reputation: 27343

using System.Text.RegularExpressions;


string GoodParts(string input) {
  Regex re = new Regex(@"^(.*\D)\d{4}(\D|$)");
  var match = re.Match(input);
  string result = Regex.Replace(match.Groups[1].Value, "[^0-9a-zA-Z]+", "");
  return result;
}

Upvotes: 1

Ravi Gadag
Ravi Gadag

Reputation: 15861

you can try like this

/\b\d{4}\b/

d{4}\b will match four d's at a word boundary.Depending on the input data you may also want to consider adding another word boundary (\b) at the beginning.

Upvotes: 1

Related Questions