chepe263
chepe263

Reputation: 2812

Regex: Split X length words

I'm new to regular expresions. I have a gigantic text. In the aplication, i need words of 4 characters and delete the rest. The text is in spanish. So far, I can select 4 char length words but i still need to delete the rest.

This is my regular expression

\s(\w{3,3}[a-zA-ZáéíóúäëïöüñÑ])\s

How can i get all words with 4 letters in asp.net vb?

Upvotes: 0

Views: 598

Answers (3)

Ωmega
Ωmega

Reputation: 43683

/(?:\A|(?<=\P{L}))(\p{L}{4})(?:(?=\P{L})|\z)/g

Explanation:

Switch /g is for repeatedly search

\A is start of the string (not start of line)

\p{L} matches a single code point in the category letter

\P{L} matches a single code point not in the category letter

{n} specify a specific amount of repetition [n is number]

\z is end of string (not end of line)

| is logic OR operator

(?<=) is lookbehind

(?=) is lookahead

(?:) is non backreference grouping

() is backreference grouping

Upvotes: 3

tweak2
tweak2

Reputation: 656

Using the character class provided above in another answer (\w does NOT match spanish word characters unfortunately).

You can use this for a match (it matches the reverse, basically matches everything that is NOT a 4-character word, so you can replace with " ", leaving only the 4-character words):

/(^|(?<=(?<=\W)[a-zA-ZáéíóúäëïöüñÑ]{4,4}(?=\W)))(.*?)((?=(?<=\W)[a-zA-ZáéíóúäëïöüñÑ]{4,4}(?=\W))|$)/gis

Approximated code in VB (not tested):

  Dim input As String = "This is your text"
  Dim pattern As String = "/(^|(?<=(?<=\W)[a-zA-ZáéíóúäëïöüñÑ]{4,4}(?=\W)))(.*?)((?=(?<=\W)[a-zA-ZáéíóúäëïöüñÑ]{4,4}(?=\W))|$)/gis"
  Dim replacement As String = " "
  Dim rgx As New Regex(pattern)
  Dim result As String = rgx.Replace(input, replacement)

  Console.WriteLine("Original String: {0}", input)
  Console.WriteLine("Replacement String: {0}", result)                             

You can see the result of the regex in action here:

http://regexr.com?30n29

Upvotes: 2

Jeff Lamb
Jeff Lamb

Reputation: 5865

\[^a-zA-ZáéíóúäëïöüñÑ][a-zA-ZáéíóúäëïöüñÑ]{4}[^a-zA-ZáéíóúäëïöüñÑ]\g

Translated: A non-letter, followed by 4 letters, followed by a non-letter. The 'g' indicated will match globally ... more than once.

Check out this link to find out more info on looping over your matches: http://osherove.com/blog/2003/5/12/practical-parsing-using-groups-in-regular-expressions.html

Upvotes: -2

Related Questions