eofficecom com
eofficecom com

Reputation: 27

How to troubleshoot XPath not finding elements on a web page?

I am using a Bash script to search a web page with a specific XPath expression:


# Define variables for the URL and browser
sGDomain="idealista"
sGCitta="fucecchio-firenze"
sGTypo="vendita-case"
iGPagina=1

# Start of the loop
while :; do

    # Build the URL with the iGPagina variable
    url="https://www.$sGDomain.it/$sGTypo/$sGCitta/lista-$iGPagina.htm"
    #echo "$url"
    
    # Get the HTML content of the page
    html_content=$(curl -s -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0" "$url")

    echo "$html_content" > htmlcompleto.txt
    
    # Check if the error string is not present in the HTML content
    if [[ ! $html_content =~ "Successiva" ]]; then
        break  # Exit the loop if the error string is not present
    fi
    
    # Use xidel to extract the ads
    xidel_output=$(xidel --silent --xpath '
        //div[contains(@class, "item-info-container")] ! string-join(
            (
                ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ),
                ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ),
                ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ),
                ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) )
            ),
            codepoints-to-string(9)
        )
    ' -)

    # Check if the temporary file exists and delete it if present
    if [ -f "temp.txt" ]; then
        rm temp.txt
    fi

    # Replace special characters from "desc=" to the end of each line in semi.txt
    echo "$xidel_output" | sed -e "s/desc=\(.*\)\(['\"]\)/desc=\1 /g" > semi.txt

    sed -i 's/\([0-9]\{1,\}\)\.\([0-9]\{1,\}\),[0-9]\{2\}/\1\2/g' semi.txt
    sed -i 's/m²//g' semi.txt

    # Concatenate semi.txt with debugtxt.txt for debugging purposes
    cat semi.txt >> debugtxt.txt

    # Connect to the SQLite database
    db_file="immo.db"
 
    # Loop through the lines and insert them into the SQLite database
    while IFS= read -r line; do
        # Extract price, size, link, and description values from the lines using awk
        prezzo=$(echo "$line" | awk -F 'price=' '{print $2}' | awk -F 'size=' '{print $1}')
        size=$(echo "$line" | awk -F 'size=' '{print $2}' | awk -F 'link=' '{print $1}')
        link=$(echo "$line" | awk -F 'link=' '{print $2}' | awk -F 'desc=' '{print $1}')
        descrizione=$(echo "$line" | awk -F 'desc=' '{print $2}')

        # Determine if the description contains "asta"
        if [[ $descrizione =~ "asta" ]]; then
            asta=1
        else
            asta=0
        fi

        # Insert the data into the SQLite database
        sqlite3 "$db_file" "INSERT INTO $sGDomain (prezzo, link, descrizione, metratura, asta) VALUES ('$prezzo', '$link', '$descrizione', '$size', $asta)"
    done < semi.txt

    # Increment the iGPagina variable for the next iteration**your text**
    ((iGPagina++))
done

Although I believe the XPath is correct, the script fails to find anything on the page.

Web page URL: https://www.idealista.it/vendita-case/fucecchio-firenze/lista-18.htm.

XPath expression used:

//div[contains(@class, "item-info-container")] ! string-join(
  (
    ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ),
    ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ),
    ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ),
    ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) )
  ),
  codepoints-to-string(9)
)

I also tried this XPath expression with //div[contains(@class, "items-container items-list")], but it didn't work either.

Expected result: I expect to extract the price, the listing link, the description, and the square meters from each listing on the web page.

Upvotes: 1

Views: 126

Answers (1)

Reino
Reino

Reputation: 3443

So the essence of your question, your script, comes down to; how can I execute several sqlite3-commands with information extracted from a particular website.

I have no experience with sqlite3, so I can't comment on @Fravadona's JSON solution, but I do know xidel. If you ask me, I think your best bet would be to let xidel create the sqlite3-commands as strings, which you can then execute afterwards. In the process there's no need for curl, awk, etc. Make sure you're using the latest (development) build of xidel though.

$ xidel -s \
  --user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0" \
  "https://www.idealista.it/vendita-case/fucecchio-firenze/lista-1.htm" \
  -e '
    for $x in //div[contains-token(@class,"item-info-container")]
    let $a:=$x/div[contains-token(@class,"item-description")]/normalize-space(p[@class="ellipsis"])
    return
    $x/`sqlite3 "immo.db" "INSERT INTO idealista (prezzo, link, descrizione, metratura, asta) VALUES ('\''{
      div[@class="price-row"]/span[contains-token(@class,"item-price")]/text()
    }'\'', '\''{
      a[@class="item-link "]/resolve-uri(@href)
    }'\'', '\''{
      $a
    }'\'', '\''{
      tokenize(div[@class="item-detail-char"]/span[contains(text(),"m2")])[1]
    }'\'', '\''{
      if (matches($a,"asta")) then 1 else 0
    }'\'')"`
  ' \
  -f '//li[@class="next"]/a/@href' \
  2>/dev/null
  • I've tested several other more recent user-agent strings, but so far only the one you provided worked. Very peculiar.
  • Put the description in a variable ($a) to check later on if it contains "asta" or not. If "ASTA", or "Asta" is also acceptable, then you can use contains() instead of matches().
  • The string concatenation expression used here is an XPath 4.0 String Template.
  • -f (or --follow) before an extraction-query 'follows' / opens the queried url just once. After an extraction-query it works recursively. In other words, in this case xidel opens 'lista-1.htm', executes the extraction-query, opens 'lista-2.htm', again executes the extraction-query, opens 'lista-3.htm', etc.
  • 2>/dev/null is to silence the message "!!! Recursive follow is deprecated and might be removed soon. !!!" This has come up almost 3 years ago, but so far the feature is still present.
  • See this gist for intermediate steps.

Instead of the XPath 4.0 String Templates, you could also use good old concat():

$ xidel -s \
  --user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0" \
  "https://www.idealista.it/vendita-case/fucecchio-firenze/lista-1.htm" \
  -e '
    for $x in //div[contains-token(@class,"item-info-container")]
    let $a:=$x/div[contains-token(@class,"item-description")]/normalize-space(p[@class="ellipsis"])
    return
    $x/concat(
      "sqlite3 ""immo.db"" ""INSERT INTO idealista (prezzo, link, descrizione, metratura, asta) VALUES ('\''",
      div[@class="price-row"]/span[contains-token(@class,"item-price")]/text(),
      "'\'', '\''",
      a[@class="item-link "]/resolve-uri(@href),
      "'\'', '\''",
      $a,
      "'\'', '\''",
      tokenize(div[@class="item-detail-char"]/span[contains(text(),"m2")])[1],
      "'\'', '\''",
      if (matches($a,"asta")) then 1 else 0,
      "'\'')"""
    )
  ' \
  -f '//li[@class="next"]/a/@href' \
  2>/dev/null

And finally you could then probably just execute these sqlite3-commands by putting eval "$( ... )" around the entire xidel-command.

Upvotes: 1

Related Questions