Seomat
Seomat

Reputation: 143

Golang: How to download a page from Internet with absolute links in html

From this:

<head>
  <link rel="stylesheet" href="styles.css">
</head>
<body>
  <img src="img.jpg" alt="" width="500" height="600">

I want to get this:

<head>
  <link rel="stylesheet" href="http://bbc.com/styles.css">
</head>
<body>
  <img src="http://bbc.com/img.jpg" alt="" width="500" height="600">

When I download a page there are relative links to css, images, etc. How to convert an HTML page while downloading to have all links in it as absolute not relative? I use this answer to download a page (How to get webpage content into a string using Go):

func main() {

    s := OnPage("http://bbc.com/")

    fmt.Printf(s)
}

func OnPage(link string) string {
    res, err := http.Get(link)
    if err != nil {
        log.Fatal(err)
    }
    content, err := ioutil.ReadAll(res.Body)
    res.Body.Close()
    if err != nil {
        log.Fatal(err)
    }
    return string(content)
}

Upvotes: 0

Views: 1573

Answers (2)

Riaz Mahmud
Riaz Mahmud

Reputation: 119

I built a package for downloading content from any URL, including images, CSS, JS, and video.

Check it out: https://github.com/Riaz-Mahmud/Websitebackup

Installation

composer require backdoor/websitebackup

Usage

use Backdoor\WebsiteBackup\WebsiteBackup;

function siteBackup(){

    $url = 'link to your website page to backup';
    $path = 'path to save backup file';

    $websiteBackup = new WebsiteBackup();
    $backup = $websiteBackup->backup($url, $path);

}

Upvotes: 1

Nikita Petrov
Nikita Petrov

Reputation: 59

You have to use Regular Expressions to replace the needed portions of the html string. Here is how you can do it (I suppose all links on the page are relative, if not, you should adjust the code):

package main

import (
    "fmt"
    "io/ioutil"
    "log"
    "net/http"
    "regexp"
)

func main() {

    s := OnPage("http://bbc.com/")

    fmt.Printf(s)
}

func OnPage(link string) string {
    res, err := http.Get(link)
    if err != nil {
        log.Fatal(err)
    }
    content, err := ioutil.ReadAll(res.Body)
    res.Body.Close()
    if err != nil {
        log.Fatal(err)
    }
    html := string(content)
    var re = regexp.MustCompile(`(<img[^>]+src)="([^"]+)"`)
    updatedHTML := re.ReplaceAllString(html, `$1="`+link+`$2"`)
    re = regexp.MustCompile(`(<link[^>]+href)="([^"]+)"`)
    updatedHTML = re.ReplaceAllString(html, `$1="`+link+`$2"`)
    return updatedHTML
}

Upvotes: 1

Related Questions