Justin Benfit
Justin Benfit

Reputation: 483

Can't access specific html with Floki

I am trying to pull the Date and Full Review for each review shown at this url: https://www.dealerrater.com/dealer/McKaig-Chevrolet-Buick-A-Dealer-For-The-People-dealer-reviews-23685/#link I am able get this:

June 17, 2021SALES VISIT - NEW"Joe was great and took extra time to help make sure I got..."- MelisaswartJoe was great and took extra time to help make sure I got the car that I wanted not pushing me into a car I didn’t want. He even made sure my car  was made ready in his day off. Great Job Thank You. 
Melisa SRead MoreCustomer ServiceQuality of WorkFriendlinessPricingOverall ExperienceRecommend Dealer
                Yes
            Employees Worked With 
                                             Taylor Prickett
                                         5.0
                                             Joe Wynne
                                         5.0
                                             Brandon McCloskey
                                         4.0Report |
        Print Helpful 0
    
    .review-response {
        overflow: hidden;
    }

    .open .review-response {
        max-height: none;
    }

     @media (max-width: 767px) {
         .public-messages {
             border-top: none !important;
             margin-left: 0 !important;
             margin-top: 5px !important;
             padding-top: 0 !important;
         }

         .review-hide {
             display: none !important;
         }

         .open .review-hide{
             display: block !important;
         }
     }

with this code:

def get_reviews_url() do
     case HTTPoison.get("https://www.dealerrater.com/dealer/McKaig-Chevrolet-Buick-A-Dealer-For-The-People-dealer-reviews-23685/#link") do
      {:ok, %HTTPoison.Response{status_code: 200, body: body}} ->
        IO.puts body
        |> Floki.find("#reviews")
        |> Enum.map(&Floki.text/1)

However I am wanting loop through each review and to put the date and the full review text for each review into a map with separate key value pairs. But when I try to scrape just the date or just the review text itself I get nothing in return and can't figure it out. Here is my best attempt at coding it:

def get_reviews_url() do
    case HTTPoison.get("https://www.dealerrater.com/dealer/McKaig-Chevrolet-Buick-A-Dealer-For-The-People-dealer-reviews-23685/#link") do
     {:ok, %HTTPoison.Response{status_code: 200, body: body}} ->
       IO.puts body
       |> Floki.find("div.italic col-xs-6 col-sm-12 pad-none margin-none font-20")#html for dates
       |> Floki.find("h3.no-format inline italic-bolder font-20 dark-grey") #html for review text
       |> Enum.map(&Floki.text/1)

This just returns :ok and I have tried every way I can think of after reading the documentation and can't get a different result. Any direction would be helpful. Thanks.

Upvotes: 0

Views: 200

Answers (1)

Adam Millerchip
Adam Millerchip

Reputation: 23129

Not really sure how to answer this without just doing it for you, so here goes. You can adjust this however you need.

"https://www.dealerrater.com/dealer/McKaig-Chevrolet-Buick-A-Dealer-For-The-People-dealer-reviews-23685/#link"
|> HTTPoison.get!()
|> Map.get(:body)
|> Floki.parse()
|> Floki.find(".review-entry")
|> Map.new(fn entry ->
  [{"div", _, [date]}] = Floki.find(entry, "div.italic")
  [{"p", _, [content]}]  = Floki.find(entry, "p.review-content")
  {date, content}
end)

Output:

%{
  "June 17, 2021" => "Joe was great and took extra time to help make sure I got the car that I wanted not pushing me into a car I didn’t want. He even made sure my car  was made ready in his day off. Great Job Thank You. \r\nMelisa S",
  "June 20, 2021" => "Awesome service, Adrian was great to work with I told him what I wanted and he showed me the best car Thank you so much!",
  ...
}

Key points:

  1. Don't pipe the output of IO.puts, which is :ok, into Floki (If you are debugging, use IO.inspect instead, which returns the same value, making it possible for use in pipes).
  2. Call Floki.parse() first to parse the HTML.
  3. First find the reviews using the .review-entry selector, then map over the results to extract the parts you want.
  4. The div.italic selector was just the first thing I wrote that works to find the date, it looks pretty fragile so you might want to come up with a better version.
  5. You might want to change Map.new to Enum.map, because if there are multiple reviews on the same date, this will only return the last one. Changing to Enum.map will give you a list of {date, review} tuples.

Upvotes: 1

Related Questions