Reputation: 45
A page loads 35.000 elements, which only the first 10 are of interest to me. Returning all elements makes the scraping extremely slow. I only succeeded in either returning the first element with:
driver.find_element_by
Or returning all, 35.000 elements, with:
driver.find_elements_by
Anyone knows a way to return x amount of elements found?
Upvotes: 3
Views: 1266
Reputation: 151441
Here is a significantly different approach presented as a different answer because some people will prefer this one to the other one I gave, or the other one to this one.
This one relies on using XPath to slice the results:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.example.com")
# We add 35000 paragraphs with class `test` to the page so that we can
# later show how to get the first 10 paragraphs of this class. Each
# paragraph is uniquely numbered. These paragraphs are put into
# individual `div` to make sure they are not siblings of one
# another. (This prevents offering a naive XPath expression that would
# work only if they *are* siblings.)
driver.execute_script("""
var html = [];
for (var i = 0; i < 35000; ++i) {
html.push("<div><p class='test'>"+ i + "</p></div>");
}
document.body.innerHTML += html.join("");
""")
elements = driver.find_elements_by_xpath(
"(//p[@class='test'])[position() < 11]")
for element in elements:
print element.text
driver.quit()
Note that XPath uses 1-based indexes so < 11
is indeed the proper expression. The parentheses around the first part of the expression are absolutely necessary. With these parentheses, the [position() < 11]
test checks the position each node has in the nodeset which is the result of the expression in parentheses. Without them, the position test would check the position of the nodes relative to their parents nodes, which would match all nodes because all <p>
are at the first position in their respective <div>
. (This is why I've added those <div>
elements above: to show this problem.)
I would use this solution if I were already using XPath for my selection. Otherwise, if I were doing a search by CSS selector or by id I would not convert it to XPath only to perform the slice. I'd use the other method I've shown.
Upvotes: 1
Reputation: 151441
Selenium does not provide a facility that allows returning only a slice of the .find_elements...
calls. A general solution if you want to optimize things so that you do not need to have Selenium return every single element is perform the slice operation on the browser side, in JavaScript. I present this solution in this answer here. If you want to use XPath for selecting the DOM nodes, you could adapt the answer here to that, or you could use the method in another answer I've submitted.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.example.com")
# We add 35000 paragraphs with class `test` to the page so that we can
# later show how to get the first 10 paragraphs of this class. Each
# paragraph is uniquely numbered.
driver.execute_script("""
var html = [];
for (var i = 0; i < 35000; ++i) {
html.push("<p class='test'>"+ i + "</p>");
}
document.body.innerHTML += html.join("");
""")
elements = driver.execute_script("""
return Array.prototype.slice.call(document.querySelectorAll("p.test"), 0, 10);
""")
# Verify that we got the first 10 elements by outputting the text they
# contain to the console. The loop here is for illustration purposes
# to show that the `elements` array contains what we want. In real
# code, if I wanted to process the text of the first 10 elements, I'd
# do what I show next.
for element in elements:
print element.text
# A better way to get the text of the first 10 elements. This results
# in 1 round-trip between this script and the browser. The loop above
# would take 10 round-trips.
print driver.execute_script("""
return Array.prototype.slice.call(document.querySelectorAll("p.test"), 0, 10)
.map(function (x) { return x.textContent; });;
""")
driver.quit()
The Array.prototype.slice.call
rigmarole is needed because what document.querySelectorAll
returns looks like an Array
but is not actually an Array
object. (It is a NodeList
.) So it does not have a .slice
method but you can pass it to Array
's slice
method.
Upvotes: 2