mnort9
mnort9

Reputation: 1820

Extract a string from HTML with NodeJS

Here is the html...

<iframe width="100%" height="166" scrolling="no" frameborder="no" 
src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F11111111&amp;auto_play=false
&amp;show_artwork=true&amp;color=c3000d&amp;show_comments=false&amp;liking=false
&amp;download=false&amp;show_user=false&amp;show_playcount=false"></iframe>

I'm using NodeJS. I'm trying to extract the trackID, in this case 11111111 following tracks%2F. What is the most stable method for performing this?

Should I use regex or some JS string method such as substring() or match()?

Upvotes: 1

Views: 3739

Answers (6)

Jon Musselwhite
Jon Musselwhite

Reputation: 1821

Update for 2019...

This builds off of blueiur's answer and walks through a solution in more detail. JSDOMneeds to be installed before you can use it:

npm install jsdom

Now, according to the documentation, you can instantiate JSDOM like this:

const jsdom = require('jsdom');
const { JSDOM } = jsdom;

You've already got some html you want to parse, I'll use your example and define it as a template literal:

const data = `<iframe width="100%" height="166" scrolling="no" frameborder="no" 
src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F11111111&amp;auto_play=false
&amp;show_artwork=true&amp;color=c3000d&amp;show_comments=false&amp;liking=false
&amp;download=false&amp;show_user=false&amp;show_playcount=false"></iframe>`;

Here's the fun part... parse the html in NodeJS:

const { document } = (new JSDOM(data)).window;

What's happening here? You're creating a new JSDOM object with the provided HTML and grabbing the document attribute of the window attribute. From this point on, you can use document.getElementsByTagName() and other similar functions just like you would in a browser.

To continue with your specific example, you want to extract the src attribute of the only iframe in the document. There are multiple ways to do that. One example is to use getElementsByTagName to pull the first iframe like this:

const src1 = document.getElementsByTagName('iframe')[0].src;

Now that we have the src attribute, we can split it apart and process the url query value. This is where we will use the URL class which comes with NodeJS. According to the documentation, we can get the search parameters by creating a URL object and accessing the searchParams attribute like this:

const params = (new URL(src1)).searchParams;

Now you've got the query string as a URLSearchParams object and you can access individual terms like this:

const scURL = params.get('src');

If you look at the contents of scURL now, you'll find it is the embedded url which was passed as a query, so we can parse that with another URL object and extract the pathname attribute like this:

const src2 = (new URL(src2)).pathname;

We're getting close now, and can split the path apart to the get value you wanted using JavaScript's standard string functions:

const val = src2.split('/')[2];

And print the result:

console.log(val);

... which produces this output:

11111111

To summarize, here is the complete code:

const jsdom = require('jsdom');
const { JSDOM } = jsdom;

const data = `<iframe width="100%" height="166" scrolling="no" frameborder="no" 
src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F11111111&amp;auto_play=false
&amp;show_artwork=true&amp;color=c3000d&amp;show_comments=false&amp;liking=false
&amp;download=false&amp;show_user=false&amp;show_playcount=false"></iframe>`;

const { document } = (new JSDOM(data)).window;

const src1 = document.getElementsByTagName('iframe')[0].src;

const params = (new URL(src1)).searchParams;

const scURL = params.get('src');

const src2 = (new URL(src2)).pathname;

const val = src2.split('/')[2];

console.log(val);

Feel free to consolidate that and eliminate intermediate values as desired.

Upvotes: 2

Michael Lorton
Michael Lorton

Reputation: 44436

The Right™ way to to do this is to parse the HTML using some XML parser and get the URL that way and then use a reg-exp to parse the URL.

If for some reasons you don't have an infinite amount of time and energy, one of the proposed purely reg-exp solutions would work.

Upvotes: 0

reagan
reagan

Reputation: 653

If you know tracks%2F is only going to show up once you could do:

var your_track_ID = src.split(/tracks%2F/)[1].split(/&amp/)[0];

There are probably better ways, but that should work fine for your purposes.

Upvotes: 2

blueiur
blueiur

Reputation: 1507

You can find tracks with node module [url + jsdom + qs]

Try this

var jsdom = require('jsdom');
var url = require('url');
var qs = require('qs');

var str = '<iframe width="100%" height="166" scrolling="no" frameborder="no"'
  + 'src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F11111111&amp;auto_play=false"'
  + '&amp;show_artwork=true&amp;color=c3000d&amp;show_comments=false&amp;liking=false'
  + '&amp;download=false&amp;show_user=false&amp;show_playcount=false"></iframe>';

jsdom.env({
  html: str,
  scripts: [
    'http://code.jquery.com/jquery-1.5.min.js'
  ],
  done: function(errors, window) {
    var $ = window.$;
    var src = $('iframe').attr('src');
    var aRes = qs.parse(decodeURIComponent(url.parse(src).query)).url.split('/');
    var track_id = aRes[aRes.length-1];

    console.log("track_id =", track_id);
  }
});

The result is:

track_id = 11111111

Upvotes: 1

Ricardo Tomasi
Ricardo Tomasi

Reputation: 35263

It's generally a terribly bad idea to parse HTML with a regular expression, but this might be forgivable. I'd look for the complete URL for safety:

var pattern = /w\.soundcloud\.com.*tracks%2F(\d+)&/
  , trackID = (html.match(pattern) || [])[1]

Upvotes: 1

kenbritton
kenbritton

Reputation: 46

If the track id is always 8 digits and the html doesn't change you can do this:

var trackId = html.match(/\d{8}/)

Upvotes: 0

Related Questions