Reputation: 77
I want to scrape a list of elements (name player, cost, buyer, seller, day) from a local HTML file, but I have a problem with the 2 and 3 when i try to scrape buyer and seller (in this case for the 1st transfer 'Computer' and 'Peter') and for the 2nd transfer 'Computer' and 'james')
document.querySelector("#pressReleases > ul > li:nth-child(**2**) > ul > li.text > div > strong:nth-child(2)")
document.querySelector("#pressReleases > ul > li:nth-child(**3**) > ul > li.text > div > strong:nth-child(2)")
How can scrape the li
elements making this 2 variable?
I've tried this in R:
dades<- mylocalfile
player<-dades %>% html_nodes("ul.player li.text strong") %>% html_text() %>% trimws()
cost<-dades %>% html_nodes("ul.player li.text span") %>% html_text() %>% trimws()
buyer<-dades %>% html_nodes("#pressReleases > ul > li:nth-child(2) > ul > li.text > div > strong:nth-child(2)") %>% html_text() %>% trimws()
seller<-dades %>% html_nodes("#pressReleases > ul > li:nth-child(2) > ul > li.text > div > strong:nth-child(1)") %>% html_text() %>% trimws()
day<-dades %>% html_nodes("ul.player li.text time") %>% html_text() %>% trimws()
I detected that this 2 #pressReleases > ul > li:nth-child(2)
is variable for each li class="post pressRelease"
The html code:
<div class="newsList" id="pressReleases">
<ul>
<li class="date" style="background-color: rgb(128, 128, 128);">
<strong>Fitxatges del dia</strong>
09/08/2019
</li>
<li class="post pressRelease">
<ul class="player">
<li class="photo">
<img src="./futmondo - Fútbol fantasy manager - futmondo_files/espanyol.png" onerror="Futmondo.Helpers.Resources.onErrorPlayerPhoto(this, "L", "espanyol.png")">
<img src="./futmondo - Fútbol fantasy manager - futmondo_files/espanyol(1).png" alt="Espanyol" class="crest">
</li>
<li class="text">
<strong>Player1</strong>
<time>09/08/2019 - 05:30</time>
<span>16.245.485 €</span>
<div class="from">
D'
<strong>computer</strong>
a
<strong>peter</strong>
</div>
</li>
<a class="icon-revert">
</a>
</ul>
<div class="bid second">
<span class="triangle"></span>
<strong class="second">2º puja</strong>
<strong>matheu:</strong>
<span class="price">15.925.828 €</span>
</div>
</li>
<li class="post pressRelease">
<ul class="player">
<li class="photo">
<img src="./futmondo - Fútbol fantasy manager - futmondo_files/real-sociedad.png" onerror="Futmondo.Helpers.Resources.onErrorPlayerPhoto(this, "L", "real-sociedad.png")">
<img src="./futmondo - Fútbol fantasy manager - futmondo_files/real-sociedad(1).png" alt="Real Sociedad" class="crest">
</li>
<li class="text">
<strong>Player2</strong>
<time>09/08/2019 - 05:30</time>
<span>1.111.711 €</span>
<div class="from">
D'
<strong>computer</strong>
a
<strong>james</strong>
</div>
</li>
<a class="icon-revert">
</a>
</ul>
</li>
Upvotes: 2
Views: 696
Reputation: 84465
Have you tried for buyers
#pressReleases .from strong:nth-child(1)
and for sellers
#pressReleases .from strong:nth-child(2)
Assuming you have read html into variable page
then (extend to include your other vars)
buyers <- page %>% html_nodes("#pressReleases .from strong:nth-child(1)") %>% html_text
sellers <- page %>% html_nodes("#pressReleases .from strong:nth-child(2)") %>% html_text
df <- as.data.frame(cbind(buyers,sellers))
The dataframe then should be easy to export.
Upvotes: 0
Reputation: 5281
Here is a possible solution to get the buyer/seller
:
# Read the local file
URL <- 'D:/Test/Test.html'
wp <- xml2::read_html(URL, encoding = 'utf-8')
# Extract the relevant nodes
node <- rvest::html_nodes(wp, '.from')
# Extract the names
seller <- gsub('.*D\'\r\n\\s+(.*?)\r\n\\s+a\\s?\r\n\\s+(.*?)\r\n.*', '\\1', rvest::html_text(node))
# [1] "computer" "computer"
buyer <- gsub('.*D\'\r\n\\s+(.*?)\r\n\\s+a\\s?\r\n\\s+(.*?)\r\n.*', '\\2', rvest::html_text(node))
# [1] "peter" "james"
Upvotes: 3