Rasmus Olsen
Rasmus Olsen

Reputation: 11

Get all email addresses from website to csv

I need to extract all email addresses from this website: http://www.danskeark.dk/Medlemsindex.aspx To navigate to the addresses go to letter A,B,C,D... and then by company.

I also need to export the found addresses to excel.

How do I do that the easiest way?

Upvotes: 1

Views: 177

Answers (2)

blackholyman
blackholyman

Reputation: 1450

Here is a little crawler made with ahk (Free open-source scripting language for Windows)

So you will need to download that from the link above

I used a visible IE object to keep what its doing open, makes it a bit slow (5-7 mins) but hey if you only need it once...

url := "http://www.danskeark.dk/Medlemsindex.aspx"

wb := ComObjCreate("InternetExplorer.Application")
wb.visible := true

virksomheds_Urls := []
chars := "ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅ0123456789"
loop, parse, chars
{
    index := "?index=" A_LoopField
    wb.Navigate(url . index)
    while wb.readyState!=4 || wb.document.readyState != "complete" || wb.busy
        continue
    pages := wb.document.getElementById("pagesTop").getElementsByTagName("A").length - 1
    loop % pages
    {
        wb.Navigate(url . index . "&pg=" A_index)
        while wb.readyState!=4 || wb.document.readyState != "complete" || wb.busy
            continue
        loop % (links := wb.document.getElementsByTagName("UL")[1].getElementsByTagName("A")).length
        {
            virksomheds_Urls.Insert(links[A_index-1].href)
        }
    }
}
for, key, val in virksomheds_Urls
{
    wb.Navigate(val)
    while wb.readyState!=4 || wb.document.readyState != "complete" || wb.busy
        continue
    csv .= (Email := wb.document.getElementById("divContactBox").GetelementsByTagName("A")[0].innertext) ","
}
FileAppend, %csv%, Emails_csv.csv
run, excel.exe Emails_csv.csv
return

Upvotes: 0

overflowed
overflowed

Reputation: 1838

mirror the site with wget in a new dir

wget -mk --domains danskeark.dk danskeark.dk

grep all mail adresses out to a csv in parent dir in that dir

find . | xargs grep -E -o "\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" > ../out.csv

Upvotes: 2

Related Questions