C# Regex filter problems

Question

At this moment in time, i posted something earlier asking about the same type of question regarding Regex. It has given me headaches, i have looked up loads of documentation of how to use regex but i still could not put my finger on it. I wouldn't want to waste another 6 hours looking to filter simple (i think) expressions.

So basically what i want to do is filter all filetypes with the endings of HTML extensions (the '*' stars are from a Winforms Tabcontrol signifying that the file has been modified. I also need them in IgnoreCase:

.html, .htm, .shtml, .shtm, .xhtml
.html*, .htm*, .shtml*, .shtm*, .xhtml*

Also filtering some CSS files:

.css
.css*

And some SQL Files:

.sql, .ddl, .dml
.sql*, .ddl*, .dml*

My previous question got an answer to filtering Python files:

.py, .py, .pyi, .pyx, .pyw
Expression would be: \.py[3ixw]?\*?$

But when i tried to learn from the expression above i would always end up with opening a .xhtml only, the rest are not valid.

For the HTML expression, i currently have this: \.html|.html|.shtml|.shtm|.xhtml\*?$ with RegexOptions.IgnoreCase. But the output will only allow .xhtml case sensitive or insensitive. .html files, .htm and the rest did not match. I would really appreciate an explanation to each of the expressions you provide (so i don't have to ask the same question ever again).

Thank you.

Chrᴉz remembers Monica · Accepted Answer

For such cases you may start with a simple regex that can be simplified step by step down to a good regex expression:

In C# this would basically, with IgnoreCase, be

Regex myRegex = new Regex("PATTERN", RegexOptions.IgnoreCase);

Now the pattern: The most easy one is simply concatenating all valid results with OR + escaping (if possible):

\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html*|\.htm*|\.shtml*|\.shtm*|\.xhtml*

With .html* you mean .html + anything, which is written as .*(Any character, 0-infinite times) in regex.

\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html.*|\.htm.*|\.shtml.*|\.shtm.*|\.xhtml.*

Then, you may take all repeating patterns and group them together. All file endings start with a dot and may have an optional end and ending.* always contains ending:

\.(html|htm|shtml|shtm|xhtml).*

Then, I see htm pretty often, so I try to extract that. Taking all possible characters before and after htm together (? means 0 or 1 appearance):

\.(s|x)?(htm)l?.*

And, I always check if it's still working in regexstorm for .Net

That way, you may also get regular expressions for the other 2 ones and concat them all together in the end.

C# Regex filter problems

Answers (1)

Related Questions