Reputation: 1900

Regex an innerHtml of a table to find special charcters

I'm having an hard time to get this..

I have this html code:

<table border='1'><tr><th></th><th>Fact Questions Report Type Count</th></tr><tr>
<td class=' sorting_1'>0 - 18</td><td>78</td></tr><tr><td class=' sorting_1'>19-64</td>
<td>78</td></tr><tr><td class=' sorting_1'>65+</td><td>78</td></tr><tr>
<td class=' sorting_1'>אין גיל</td><td>78</td></tr><tr><td class=' sorting_1'>נפטר</td>
<td>78</td></tr><tr><td class=' sorting_1'>Unknown</td><td>78</td></tr></table>

As you see there are special characters that I want to catch like those:

אין גיל , נפטר

I thought to do a regex that will exclude all words \W and numbers \D and those->=|'

But i can't get it work..

The perfect solution will be getting two items with the special charcters... אין גיל , נפטר

P.S: There could be other special charcters

I will love to see an example for this in here : RegexPal - Online Editor

tnx!

Upvotes: 0

Answers (3)

Andrew Cheong

Reputation: 30283

If you are trying to catch characters in the Hebrew language specifically, you can try

[\u0590-\u05FF\s]+

assuming spaces are okay, or, if using a more advanced regex engine,

[\p{Hebrew}\s]+

If you're actually trying to catch non-English but printable characters then it's hard to help you without seeing what you've tried. \D is a subset of \W, so you should only need \W+, or if I understand you correctly in that you want to exclude ->=|' as well, then [^\w>=|-]+ (the dash must come last here (or in the second position after ^)).

Upvotes: 2