Reputation: 8042

How to copy particular elements from web page

My goal is to get particular text area from web page. Imagine it as if you were able to draw a rectangle anywhere on a page and everything in this rectangle would be copied in your clipboard. I am using FireBug (feel free to suggest another solutions, I have searched for plugin or bookmarklets but did not find anything usefull) with it's console window and XPath for this purpose. The values which I want obtain are in following format (this was observed from FireBug "HTML inspect"):

<span class="number3_0" title="Numbers">3.00</span>

so I end up with following code, which I issue from FireBug console: $x("//span[@title='Numbers']/text()")

After this I get something like this:

[<TextNode textContent="2.00">, <TextNode textContent="2.00">, <TextNode textContent="2.00">, <TextNode textContent="2.00">, <TextNode textContent="3.00">]

After this I click (with right mouse button) on [ and select Inspect in DOM panel then I press ctrl+a and copy/paste the data in following format:

0   <TextNode textContent="2.00">
1   <TextNode textContent="2.00">
2   <TextNode textContent="2.00">
3   <TextNode textContent="2.00">
4   <TextNode textContent="3.00">

As you can assume the value of textContent is the information that I am interested in. I have tried to modify original XPath query to return me only this numbers but no luck. I was:

wrapping whole query into string() as suggested here Xpath - get only node content without other elements

trying to figure out how this one is working Extracting text in between nodes through XPath and lot of more.

To be able to obtain desired values I used some bash-scripting + xml-formatting, after this tedious/error-prone task I get following format:

<?xml version="1.0"?>
<head>
  <TextNode textContent="2.00"/>
  <TextNode textContent="2.00"/>
  <TextNode textContent="2.00"/>
  <TextNode textContent="2.00"/>
  <TextNode textContent="3.00"/>
  <TextNode textContent="3.00"/>
</head>

Now I use xmlstarlet to obtain those values (yes I know that I can use regexp in previous step and have all data that I need. But I am interesting in DOM/XPath parsing and trying to figure out how it is working) in following way:

cat input | xmlstarlet sel -t -m "//TextNode" -v 'concat(@textContent," 
")'

This finnaly gives me the desired output:

2.00
2.00
2.00
2.00
3.00

My questions are a bit generic:

How this terrible long process can be automated?
How to modify the original XPath string used in FireBug $x("//span[@title='Numbers']/text()") to immediatelly get only numbers and save myself rest of steps?
I am still not very familiar with xmlstarlet, especially selection (sel) mode drives me crazy. I have seen various combinations of following options:

-c or --copy-of - print copy of XPATH expression

-v or --value-of - print value of XPATH expression

-o or --output - output string literal

-m or --match - match XPATH expression

can somebody please explain when to use which one? It would be glad to see in particular examples if is possible. In case of interest there are various combinations of mentioned options, that I do not understand well: http://www.grahl.ch/blog/minutiae-return-content-element-xmlstarlet Extracting and dumping elements using xmlstarlet Testing for an XML attribute

4.) The last question regarding xmlstarlet is a bit cosmetic syntactical sugar, how to obtain nice newline separated output, as you can see I 'cheat' with adding newline as a separator but when I tried it with escape character like this:

cat input | xmlstarlet sel -t -m "//TextNode" -v 'concat(@textContent,"\n")'

it did not worked, also the original reference from where I learn a lot used it in this 'ugly' way http://www.ibm.com/developerworks/library/x-starlet/index.html

PS: maybe those all steps could be simplified with curl + xmlstarlet but it could be handy to have also FireBug option for pages which requires login or such other stuff.

Thanks for all idea.

Upvotes: 0

Answers (2)

fflorent

Reputation: 1646

$$("<CSS3 selector>") and $x("<XPATH>") in Firebug actually return a real Array (not like the results of document.querySelectorAll() or document.evaluate). So they are more convenient.

With Firefox + Firebug:

var numbersNode = $x("//span[@title='Numbers']/text()");
var numbersText = numbersNode.map(function(numberNode) {
    return numberNode.textContent;
}).join("\n");
// Special command of Firebug to copy text into clipboard:
copy(numbersText);

You can even do with a compact way using arrow functions of the EcmaScript 6:

copy($x("//span[@title='Numbers']/text()").map(x => x.textContent).join("\n"));

The same if you chose $$('span[title="Numbers"]') as suggested William Narmontas.

Florent

Upvotes: 1

ScalaWilliam

Reputation: 741

From what I gather you want to collect numbers from spans that have a title 'Numbers' and want it as a string.

Try the following:

var numberNodes = document.querySelectorAll('span[title="Numbers"]')
function giveText(me) { return me.textContent; }
Array.prototype.map.call(numberNodes, giveText).join("\n");

The first line selects all nodes using CSS query selectors in the document (meaning you do not need XPath). The second line creates a function that returns the text content of a node. The third line maps the elements from the numberNodes list using the giveText function, produces an array of numbers, and then finally joins them with a newline.

After this you might not need this xmlstarlet.

Upvotes: 2

How to copy particular elements from web page

Answers (2)

Related Questions