Reputation: 47
I am just starting with web scraping in R, I put this code:
mps <- read_html("http://tunisie-annonce.com/AnnoncesImmobilier.asp")
mps %>%
html_nodes("tr") %>%
html_text()
To get the needed content that I put in a text file. My problem is that I want to eliminate these red points, but I can't. Could you please help me?
I think these points are replacing <b>
and <br>
in the html code.
Upvotes: 3
Views: 397
Reputation: 1345
You can always use regular expressions to remove undesired chars, e.g.,
mps <- gsub("•", " ", mps)
Upvotes: 0
Reputation: 43364
Whoever constructed that page very frustratingly assembled the table within a table, but not defined as a <table>
tag itself, so it's easiest to redefine it so it will parse more easily:
library(rvest)
mps <- read_html("http://tunisie-annonce.com/AnnoncesImmobilier.asp")
df <- mps %>%
html_nodes("tr.Entete1, tr.Tableau1") %>% # get correct rows
paste(collapse = '\n') %>% # paste nodes back to a single string
paste('<table>', ., '</table>') %>% # add enclosing table node
read_html() %>% # reread as HTML
html_node('table') %>%
html_table(fill = TRUE) %>% # parse as table
{ setNames(.[-1,], make.names(.[1,], unique = TRUE)) } # grab names from first row
head(df)
#> X Région NA. Nature NA..1 Type NA..2
#> 2 Prix <NA> NA <NA> NA <NA> NA
#> 3 Modifiée NA <NA> NA <NA> NA
#> 4 Kelibia NA Terrain NA Terrain nu NA
#> 5 Cite El Ghazala NA Location NA App. 4 pièc NA
#> 6 Le Bardo NA Location NA App. 1 pièc NA
#> 7 Le Bardo NA Location vacance NA App. 3 pièc NA
#> Texte.annonce NA..3 Prix Prix.1 X.1 Modifiée
#> 2 <NA> NA <NA> <NA> <NA> <NA>
#> 3 <NA> NA <NA> <NA> <NA> <NA>
#> 4 Terrain a 5 km de kelibi NA 80 000 07/05/2017
#> 5 S plus 3 haut standing c NA 790 07/05/2017
#> 6 Appartements meubles NA 40 000 07/05/2017
#> 7 Un bel appartement au bardo m NA 420 07/05/2017
#> Modifiée.1 NA..4 NA..5
#> 2 <NA> NA NA
#> 3 <NA> NA NA
#> 4 <NA> NA NA
#> 5 <NA> NA NA
#> 6 <NA> NA NA
#> 7 <NA> NA NA
Note there's a lot of NA
s and other cruft here yet to be cleaned up, but at least it's usable at this point.
Upvotes: 1