SimonDau
SimonDau

Reputation: 547

Parsing og: tags with ColdFusion regex

If one wants to extract/match Open Graph (og:) tags from html, using regex (and ColdFusion 9+), how would one go about doing it?

And the tricky bit is that is has to cover both possible variations of tag formation as in the following examples:

<meta property="og:type" content="website" /> 
<meta content="website" property="og:type"/> 

So far all I got is this:

<cfset tags = ReMatch('(og:)(.*?)>',html_content)>

It does match both of the links, however only the first type has the content bit returned with it. And content is something that I require.

Just to make it absolutely clear, the desired output should be an array with all of the OG tags (they could be 'type,image,author,description etc.). That means it should be flexible and not based on the og:type example alone.

Of course if it's possible, the ideal output would be a struct with the first column being the name of tag, and the second containing the value (content). But that can be achieved with the post processing and is not as important as extracting the tags themselves.

Cheers, Simon

Upvotes: 1

Views: 422

Answers (2)

SimonDau
SimonDau

Reputation: 547

Ok, so after the suggestion from @abbottmw (thank you very much!), here's the solution:

Download Jsoup jar file from here: http://jsoup.org/download

Then initiate it like this:

<cfhttp url="...." result="oghtml" > /*to get your html content*/
<cfscript>
    paths = expandPath("/lib/jsoup.jar"); //or wherever you decide to place the file
    loaderObj =createObject("component","javaloader.JavaLoader").init([expandPath('/lib/jsoup.jar')]);
    jsoup = loaderObj.create("org.jsoup.Jsoup");
    doc = jsoup.parse(oghtml);
    tags = doc.select("meta[property*=og:]"); 
</cfscript>
<cfloop index="e" array="#tags#">
    <cfoutput>
        #e.attr("property")# | #e.attr("content")#<br />
    </cfoutput>
</cfloop>

And that is it. The complete list of og tags is in the [tags] array.

Of course it's not the regex solutions, which was originally requested, but hey, it works!

Upvotes: 1

abbottmw
abbottmw

Reputation: 754

So you want an array like ['og:author','og:type', 'og:image'...]?

Try using a regex like og:([\w]+)

That should give you a start. You will have duplicates if you have two of the same og:foo meta tags.

You can look at JSoup also to help parse the HTML for you. It makes it a lot easier.

There are a few good blog posts on using it in CFML

jQuery-like parsing in Java

Parsing, Traversing, And Mutating HTML With ColdFusion And jSoup

Upvotes: 2

Related Questions