Adam Kim
Adam Kim

Reputation: 63

scrab urls from web page using HtmlAgilityPack

this my code so far

 foreach (var listBoxItem in listBox_google_urls.Items)
        {              
            var document = new HtmlWeb().Load(listBoxItem.ToString());
            var files = document.DocumentNode.Descendants("a").Select(a => a.GetAttributeValue("href", ".mp3")).Where(h => h.Contains(".mp3")).ToArray(); 
            listbox_urls.Items.AddRange(files);
        }      

and this where come listBox_google_urls.Items

 web_search.Navigate("https://www.google.com/search?q=" + val + "+(mp3|wav|ac3|ogg|flac|wma|m4a) -inurl:(jsp|pl|php|html|aspx|htm|cf|shtml) intitle:index.of -inurl:(listen77|mp3raid|mp3toss|mp3drug|index_of|wallywashis)");
        var search_results = this.web_search.Document.Links.Cast<HtmlElement>().Select(a => a.GetAttribute("href")).Where(h => h.Contains("http://")).ToArray();
        listBox_google_urls.Items.AddRange(search_results);

listBoxItem.ToString() output example

the problem is this méthode work but only scrab titles of links only they are way how i can fix it ?? and thanks already

Upvotes: 1

Views: 125

Answers (1)

Mark Redfern
Mark Redfern

Reputation: 507

your code looks good, just not sure why you are defaulting to ".mp3" and then returning all that have ".mp3" ? you gonna end up with a collection of valid .mp3 URL's and then a whole bunch of ".mp3" strings? I just hoocked into a rando google search page and looked for all url's with the word "mail" in the href attribute, here are the results

enter image description here

Hope this answers your question. If you can give me some more info, maybe I could help a little more

Try this

        var document = new HtmlWeb().Load("http://s1.mymrmusic2.com/hmusic/Album/Foreign%20Albums/VA%20-%20Billboard%20Hot%20100%20(02%20April%202016)/VA%20-%20Billboard%20Hot%20100%20(02%20April%202016)%20%5B320%5D/");
        var files = document.DocumentNode.Descendants("a")
            .Where(a => !string.IsNullOrEmpty(a.GetAttributeValue("href", string.Empty)) && a.GetAttributeValue("href", string.Empty).Contains(".mp3"))
            .Select(a => new
            {
                Link = a.GetAttributeValue("href", string.Empty),
                Text = a.FirstChild.InnerText
            }).ToList();

enter image description here

Maybe try this option

foreach (var listBoxItem in listBox_google_urls.Items)
        {
            var document = new HtmlWeb().Load(listBoxItem.ToString());
            var files = document.DocumentNode.Descendants("a")
                .Select(a => a.GetAttributeValue("href", ".mp3"))
                .Where(h => h.Contains(".mp3"))
                .Select(a => listBoxItem.ToString() + a).ToArray();
            listbox_urls.Items.AddRange(files);
        }

Upvotes: 1

Related Questions