Reputation: 6979
I have a certain string for which I want to check if it is a html or not. I am using regex for the same but not getting the proper result.
I validated my regex and it works fine here.
var htmlRegex = new RegExp("<([A-Za-z][A-Za-z0-9]*)\b[^>]*>(.*?)</\1>");
return htmlRegex.test(testString);
Here's the fiddle but the regex isn't running in there. http://jsfiddle.net/wFWtc/
On my machine, the code runs fine but I get a false instead of true as the result. What am missing here?
Upvotes: 149
Views: 224733
Reputation: 45
You can also try this simple solution.
window.isHTML=(content)=>{
let elem = document.createElement('p');
elem.innerHTML = content;
return elem.children.length > 0;
}
isHTML('hello') //false
isHTML('<p>hello</p>') //true
isHTML('<p>hello</p> world') //true
Upvotes: 0
Reputation: 29426
Here's a sloppy one-liner that I use from time to time:
var isHTML = RegExp.prototype.test.bind(/(<([^>]+)>)/i);
It will basically return true
for strings containing a <
followed by SOMETHING
followed by >
.
By SOMETHING
, I mean basically anything except an empty string.
It's not great, but it's a one-liner.
Usage
isHTML('Testing'); // false
isHTML('<p>Testing</p>'); // true
isHTML('<img src="hello.jpg">'); // true
isHTML('My < weird > string'); // true (caution!!!)
isHTML('<>'); // false
isHTML('< >'); // true (caution!!!)
isHTML('2 < 5 && 5 > 3'); // true (caution!!!)
As you can see it's far from perfect, but might do the job for you in some cases.
Upvotes: 25
Reputation: 11
The best way to check use Function below as utils
const containsHTML = (str: string) => /<[a-z][\s\S]*>/i.test(str);
Upvotes: 1
Reputation: 3711
The most voted answer validates the following string as a HTML pattern when it obviously isn't:
true = (b<a || b>=a)
A better approach would be <([a-zA-Z]+)(\s*|>).*(>|\/\1>)
which can be visualized here.
See also the HTML Standard for further information.
This pattern is not going to validate your HTML document but rather a HTML tag. Obviously there is still room for improvements, the more you improve it the sooner you get a very-huge-complex HTML validation pattern, something you would want to avoid.
Example:
<t>
<a >
<g/>
<tag />
<tag some='1' attributes=2 foo >...
<tag some attributes/>
<tag some attributes/>...</tagx>
Upvotes: -1
Reputation: 180
Here's a regex-less approach I used for my own project.
If you are trying to detect HTML string among other non-HTML strings, you can convert to an HTML parser object and then back to see if the string lengths are different. I.e.:
An example Python implementation is as follows:
def isHTML(string):
string1 = string[:]
soup = BeautifulSoup(string, 'html.parser') # Can use other HTML parser like etree
string2 = soup.text
if string1 != string2:
return True
elif string1 == string2:
return False
It worked on my sample of 2800 strings.
The pseudocode would be
define function "IS_HTML"
input = STRING
set a copy of STRING as STRING_1
parse STRING using an HTML parser and set as STRING_2
IF STRING_1 is equal to STRING_2
THEN RETURN TRUE
ELSE IF STRING_1 is not equal to STRING_2
THEN RETURN FALSE
This worked for me in my test case, and it may work for you.
Upvotes: -1
Reputation: 118
While this is an old thread, I just wanted to share the solution I've wrote for my needs:
function isHtml(input) {
return /<[a-z]+\d?(\s+[\w-]+=("[^"]*"|'[^']*'))*\s*\/?>|&#?\w+;/i.test(input);
}
It should cover most of the tricky cases I've found in this thread. Tested on this page with document.body.innerText
and document.body.innerHTML
.
I hope it will be useful for someone. :)
Upvotes: 3
Reputation: 116
I needed something similar for xml strings. I'll put what I came up with here in case it might be useful to anyone..
static isXMLstring(input: string): boolean {
const reOpenFull = new RegExp(/^<[^<>\/]+>.*/);
const reOpen = new RegExp(/^<[^<>\/]+>/);
const reCloseFull = new RegExp(/(^<\/[^<>\/]+>.*)|(^<[^<>\/]+\/>.*)/);
const reClose = new RegExp(/(^<\/[^<>\/]+>)|(^<[^<>\/]+\/>)/);
const reContentFull = new RegExp(/^[^<>\/]+.*/);
const reContent = new RegExp(/^[^<>&%]+/); // exclude reserved characters in content
const tagStack: string[] = [];
const getTag = (s: string, re: RegExp): string => {
const res = (s.match(re) as string[])[0].replaceAll(/[\/<>]/g, "");
return res.split(" ")[0];
};
const check = (s: string): boolean => {
const leave = (s: string, re: RegExp): boolean => {
const sTrimmed = s.replace(re, "");
if (sTrimmed.length == 0) {
return tagStack.length == 0;
} else {
return check(sTrimmed);
}
};
if (reOpenFull.test(s)) {
const openTag = getTag(s, reOpen);
tagStack.push(openTag); // opening tag
return leave(s, reOpen);
} else if (reCloseFull.test(s)) {
const openTag = tagStack.pop();
const closeTag = getTag(s, reClose);
if (openTag != closeTag) {
return false;
}
// closing tag
return leave(s, reClose);
} else if (reContentFull.test(s)) {
if (tagStack.length < 1) {
return false;
} else {
return leave(s, reContent); // content
}
} else {
return false;
}
};
return check(input);
}
Upvotes: -1
Reputation: 774
Since the original request is not say the solution had to be a RegExp, just that an attempt to use a RegExp was being made. I will offer this up. It says something is HTML if a single child element can be parsed. Note, this will return false if the body contains only comments or CDATA or server directives.
const isHTML = (text) => {
try {
const fragment = new DOMParser().parseFromString(text,"text/html");
return fragment.body.children.length>0
} catch(error) { ; }
return false;
}
Upvotes: 1
Reputation: 3111
There is an NPM package is-html that can attempt to solve this https://github.com/sindresorhus/is-html
Upvotes: -1
Reputation: 39
My solution is
const element = document.querySelector('.test_element');
const setHtml = elem =>{
let getElemContent = elem.innerHTML;
// Clean Up whitespace in the element
// If you don't want to remove whitespace, then you can skip this line
let newHtml = getElemContent.replace(/[\n\t ]+/g, " ");
//RegEX to check HTML
let checkHtml = /<([A-Za-z][A-Za-z0-9]*)\b[^>]*>(.*?)<\/\1>/.test(getElemContent);
//Check it is html or not
if (checkHtml){
console.log('This is an HTML');
console.log(newHtml.trim());
}
else{
console.log('This is a TEXT');
console.log(elem.innerText.trim());
}
}
setHtml(element);
Upvotes: 0
Reputation: 179046
A better regex to use to check if a string is HTML is:
/^/
For example:
/^/.test('') // true
/^/.test('foo bar baz') //true
/^/.test('<p>fizz buzz</p>') //true
In fact, it's so good, that it'll return true
for every string passed to it, which is because every string is HTML. Seriously, even if it's poorly formatted or invalid, it's still HTML.
If what you're looking for is the presence of HTML elements, rather than simply any text content, you could use something along the lines of:
/<\/?[a-z][\s\S]*>/i.test()
It won't help you parse the HTML in any way, but it will certainly flag the string as containing HTML elements.
Upvotes: 423
Reputation: 16121
All of the answers here are over-inclusive, they just look for <
followed by >
. There is no perfect way to detect if a string is HTML, but you can do better.
Below we look for end tags, and will be much tighter and more accurate:
import re
re_is_html = re.compile(r"(?:</[^<]+>)|(?:<[^<]+/>)")
And here it is in action:
# Correctly identified as not HTML:
print re_is_html.search("Hello, World")
print re_is_html.search("This is less than <, this is greater than >.")
print re_is_html.search(" a < 3 && b > 3")
print re_is_html.search("<<Important Text>>")
print re_is_html.search("<a>")
# Correctly identified as HTML
print re_is_html.search("<a>Foo</a>")
print re_is_html.search("<input type='submit' value='Ok' />")
print re_is_html.search("<br/>")
# We don't handle, but could with more tweaking:
print re_is_html.search("<br>")
print re_is_html.search("Foo & bar")
print re_is_html.search("<input type='submit' value='Ok'>")
Upvotes: 8
Reputation: 193261
Method #1. Here is the simple function to test if the string contains HTML data:
function isHTML(str) {
var a = document.createElement('div');
a.innerHTML = str;
for (var c = a.childNodes, i = c.length; i--; ) {
if (c[i].nodeType == 1) return true;
}
return false;
}
The idea is to allow browser DOM parser to decide if provided string looks like an HTML or not. As you can see it simply checks for ELEMENT_NODE
(nodeType
of 1).
I made a couple of tests and looks like it works:
isHTML('<a>this is a string</a>') // true
isHTML('this is a string') // false
isHTML('this is a <b>string</b>') // true
This solution will properly detect HTML string, however it has side effect that img/vide/etc. tags will start downloading resource once parsed in innerHTML.
Method #2. Another method uses DOMParser and doesn't have loading resources side effects:
function isHTML(str) {
var doc = new DOMParser().parseFromString(str, "text/html");
return Array.from(doc.body.childNodes).some(node => node.nodeType === 1);
}
Notes:
1. Array.from
is ES2015 method, can be replaced with [].slice.call(doc.body.childNodes)
.
2. Arrow function in some
call can be replaced with usual anonymous function.
Upvotes: 111
Reputation: 39
Using jQuery in this case, the simplest form would be:
if ($(testString).length > 0)
If $(testString).length = 1
, this means that there is one HTML tag inside textStging
.
Upvotes: 3
Reputation: 1006
zzzzBov's answer above is good, but it does not account for stray closing tags, like for example:
/<[a-z][\s\S]*>/i.test('foo </b> bar'); // false
A version that also catches closing tags could be this:
/<[a-z/][\s\S]*>/i.test('foo </b> bar'); // true
Upvotes: 13
Reputation: 4382
With jQuery:
function isHTML(str) {
return /^<.*?>$/.test(str) && !!$(str)[0];
}
Upvotes: 4
Reputation: 39
/<\/?[^>]*>/.test(str)
Only detect whether it contains html tags, may be a xml
Upvotes: 3
Reputation: 10169
A little bit of validation with:
/<(?=.*? .*?\/ ?>|br|hr|input|!--|wbr)[a-z]+.*?>|<([a-z]+).*?<\/\1>/i.test(htmlStringHere)
This searches for empty tags (some predefined) and /
terminated XHTML empty tags and validates as HTML because of the empty tag OR will capture the tag name and attempt to find it's closing tag somewhere in the string to validate as HTML.
Explained demo: http://regex101.com/r/cX0eP2
Update:
Complete validation with:
/<(br|basefont|hr|input|source|frame|param|area|meta|!--|col|link|option|base|img|wbr|!DOCTYPE).*?>|<(a|abbr|acronym|address|applet|article|aside|audio|b|bdi|bdo|big|blockquote|body|button|canvas|caption|center|cite|code|colgroup|command|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frameset|head|header|hgroup|h1|h2|h3|h4|h5|h6|html|i|iframe|ins|kbd|keygen|label|legend|li|map|mark|menu|meter|nav|noframes|noscript|object|ol|optgroup|output|p|pre|progress|q|rp|rt|ruby|s|samp|script|section|select|small|span|strike|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|tt|u|ul|var|video).*?<\/\2>/i.test(htmlStringHere)
This does proper validation as it contains ALL HTML tags, empty ones first followed by the rest which need a closing tag.
Explained demo here: http://regex101.com/r/pE1mT5
Upvotes: 18
Reputation: 150010
If you're creating a regex from a string literal you need to escape any backslashes:
var htmlRegex = new RegExp("<([A-Za-z][A-Za-z0-9]*)\\b[^>]*>(.*?)</\\1>");
// extra backslash added here ---------------------^ and here -----^
This is not necessary if you use a regex literal, but then you need to escape forward slashes:
var htmlRegex = /<([A-Za-z][A-Za-z0-9]*)\b[^>]*>(.*?)<\/\1>/;
// forward slash escaped here ------------------------^
Also your jsfiddle didn't work because you assigned an onload
handler inside another onload
handler - the default as set in the Frameworks & Extensions panel on the left is to wrap the JS in an onload
. Change that to a nowrap option and fix the string literal escaping and it "works" (within the constraints everybody has pointed out in comments): http://jsfiddle.net/wFWtc/4/
As far as I know JavaScript regular expressions don't have back-references. So this part of your expression:
</\1>
won't work in JS (but would work in some other languages).
Upvotes: 5