Reputation: 547
If one wants to extract/match Open Graph (og:) tags from html, using regex (and ColdFusion 9+), how would one go about doing it?
And the tricky bit is that is has to cover both possible variations of tag formation as in the following examples:
<meta property="og:type" content="website" />
<meta content="website" property="og:type"/>
So far all I got is this:
<cfset tags = ReMatch('(og:)(.*?)>',html_content)>
It does match both of the links, however only the first type has the content bit returned with it. And content is something that I require.
Just to make it absolutely clear, the desired output should be an array with all of the OG tags (they could be 'type,image,author,description etc.). That means it should be flexible and not based on the og:type example alone.
Of course if it's possible, the ideal output would be a struct with the first column being the name of tag, and the second containing the value (content). But that can be achieved with the post processing and is not as important as extracting the tags themselves.
Cheers, Simon
Upvotes: 1
Views: 422
Reputation: 547
Ok, so after the suggestion from @abbottmw (thank you very much!), here's the solution:
Download Jsoup jar file from here: http://jsoup.org/download
Then initiate it like this:
<cfhttp url="...." result="oghtml" > /*to get your html content*/
<cfscript>
paths = expandPath("/lib/jsoup.jar"); //or wherever you decide to place the file
loaderObj =createObject("component","javaloader.JavaLoader").init([expandPath('/lib/jsoup.jar')]);
jsoup = loaderObj.create("org.jsoup.Jsoup");
doc = jsoup.parse(oghtml);
tags = doc.select("meta[property*=og:]");
</cfscript>
<cfloop index="e" array="#tags#">
<cfoutput>
#e.attr("property")# | #e.attr("content")#<br />
</cfoutput>
</cfloop>
And that is it. The complete list of og tags is in the [tags] array.
Of course it's not the regex solutions, which was originally requested, but hey, it works!
Upvotes: 1
Reputation: 754
So you want an array like ['og:author','og:type', 'og:image'...]?
Try using a regex like og:([\w]+)
That should give you a start. You will have duplicates if you have two of the same og:foo meta tags.
You can look at JSoup also to help parse the HTML for you. It makes it a lot easier.
There are a few good blog posts on using it in CFML
Parsing, Traversing, And Mutating HTML With ColdFusion And jSoup
Upvotes: 2