Okashi
Okashi

Reputation: 1

HtmlAgilityPack full page loading

So, I have code which downloads pictures from parsed links. Downloading/parsing works well, but I have a problem with loading full content of this page.

/*
 * https://shikimori.org/animes/38256-magia-record-mahou-shoujo-madoka-magica-gaiden-tv
 * For testing
 */

class Program
{
    static string root = @"C:\Shikimori\";
    static List<string> sources = new List<string>();

    [STAThread]
    static void Main(string[] args)
    {
        Console.Write("Enter link: ");
        var link = Console.ReadLine(); // enter there link from above
        link += "/art";

        var web = new HtmlWeb();
        web.BrowserTimeout = TimeSpan.FromTicks(0);

        var htmlDocument = new HtmlDocument();

        Thread.Sleep(3000);

        try
        {
            htmlDocument = web.LoadFromBrowser(link); //ones per, like, 30m loading almost all page with whole pictures
        }
        catch
        {
            Console.WriteLine("an error has occured.");
        }

        Thread.Sleep(3000);

        var name = htmlDocument.DocumentNode.Descendants("div")
            .Where(node => node.GetAttributeValue("class", "")
            .Equals("b-options-floated mobile-phone_portrait r-edit")).ToList();

        //var divlink = htmlDocument.DocumentNode.Descendants("div")
        //    .Where(node => node.GetAttributeValue("class", "")
        //    .Equals("container packery")).ToList();

        var alink = htmlDocument.DocumentNode.Descendants("a")
            .Where(node => node.GetAttributeValue("class", "")
            .Equals("b-image")).ToList();

        foreach(var a in alink)
        {
            sources.Add(a.GetAttributeValue("href", string.Empty));
        }

        var tmp = Regex.Replace(name[0].GetDirectInnerText(), "[^a-zA-Z0-9._]", string.Empty);

        root += (tmp+"\\");

        if (!Directory.Exists(root))
        {
            Directory.CreateDirectory(root);
        }

        for (int i = 0; i < sources.Count; i++)
        {
            using (WebClient client = new WebClient())
            {
                var test = sources[i].Split(';').Last().Replace("url=", string.Empty);
                try
                {
                    client.DownloadFile(new Uri(test), root + test.Split('/').Last().Replace("&amp", string.Empty).Replace("?", string.Empty));
                    Console.WriteLine($"Image #{i + 1} download successfully!");
                }
                catch
                {
                    Console.WriteLine($"Image #{i + 1} download unsuccessfully...");
                }
            }
        }

        Thread.Sleep(3000);

        Console.WriteLine("Done!");
        Console.ReadKey();

    }
}

The issue is: Its working probably ones per 30 minutes, i guess? And working not that good, as I expected. Html Parser does not loading the content fully. If link has 100+ pictures, in good condition Im getting like from 5 to 15. If link (for example: https://shikimori.one/animes/1577-taiho-shichau-zo) has around 30 pictures, Its likely can parse it all. (other options not tested. Also tried to parse google pictures, Its loading like one page of the link tho, not reaching the button "More results")

I assume that the site is protected from bots, and therefore it does not always respond to requests from my program, or something like that. This guy as I understand has the same problem, but still no answer. How can this be fixed?

Upvotes: 0

Views: 665

Answers (0)

Related Questions