Reputation: 365

How to extract data (user name)from webpage

I want to collect user names from member-list pages like this: http://www.marksdailyapple.com/forum/memberslist/

I want to get every username from all the pages,

and I want to do this in linux,with bash

where should I start,could anyone me some tips?

Upvotes: 0

Answers (4)

Patrik Martinsson

Reputation: 370

As Robin suggest, you should really do this kind of stuff within a programming language containing a decent html-parser. You can always use command-line tools do various tasks, however in this case I probably would have chosen perl.

If you really want to try to do it with command-line tools i would suggest, curl, grep, sort and sed.

I always find it easier when I have something to play with, so here's something to get you started.
I would not use this kind of code to produce something useful though, but just so you could get some ideas.

The memberpages seems so be xxx://xxx.xxx/index1.html, where the 1 is indicating the page-number. Therefore the first thing I would do is to extract the number of the last memberpage. When I have that I know which urls I want to feed curl with.

Every username is in a member of the class "username", with that information we can use grep to get the relevant data.

#!/bin/bash 
number_of_pages=2
curl http://www.marksdailyapple.com/forum/memberslist/index[1-${number_of_pages}].html --silent | egrep 'class="username">.*</a>' -o | sed 's/.*>\(.*\)<\/a>/\1/' | sort

The idea here is to give curl the addresses in the format index[1-XXXX].html, that will make curl traverse all the pages. We then grep for the username class, pass it to sed to extract relevant data (the username). We then pass the produced "username-list" to sort to get the usernames sorted. I always like sorted things ;)

Big Notes though,

You should really be doing this in another way. Again, I recommend perl for these kind of tasks.
There is no errorchecking, validaton of usernames, etc, etc. If you should use this in some sort of production there are no shortcuts, do it right. Try to read up on how to parse webpages in different programming languages.
By purpose I declared number_of_pages to two. You'll have to figure out a way bý yourself to get the number of the last memberpage. It was a lot of pages though, and i imagine it would take some time to iterate through them.

Hope that helps !

Upvotes: 2

BeniBela

Reputation: 16927

This is what my Xidel was made for:

xidel http://www.marksdailyapple.com/forum/memberslist/ -e 'a.username'  -f '(//a[@rel="Next"])[1]'

With that simple line it will parse the pages with a proper html parser, use css selectors to find all links with names, use xpath to find the next page and repeat it until all pages are processed

You can also write it using only css selectors:

xidel http://www.marksdailyapple.com/forum/memberslist/ -e 'a.username'  -f 'div#pagination_top span.prev_next a'

Or pattern matching. There you basically just copy the html elements you want to find from the page source and replace the text content with {.}:

xidel http://www.marksdailyapple.com/forum/memberslist/ -e '<a class="username">{.}</a>*'  -f '<a rel="next">{.}</a>'

Upvotes: 7

cforbish

Reputation: 8819

I used this bash script to go through all the pages:

#!/bin/bash

IFS=$'\n'
url="http://www.marksdailyapple.com/forum/memberslist/"
content=$(curl --silent -L ${url} 2>/dev/null | col -b)
pages=$(echo ${content} | sed -n '/Last Page/s/^.*index\([0-9]\+\).*/\1/p' | head -1)
for page in $(seq ${pages}); do
    IFS=
    content=$(curl --silent -L ${url}index${page}.html 2>/dev/null | col -b)
    patterns=$(echo ${content} | sed -n 's/^.*class="username">\([^<]*\)<.*$/\1/gp')
    IFS=$'\n' users=(${patterns})
    for user in ${users[@]}; do
        echo "user=${user}."
    done
done

Upvotes: 1

Robin Green

Reputation: 33083

First you should use wget to get all the username pages. You will have to use some options (check the man page for wget) to make it follow the right links, and ideally not follow any of the uninteresting links (or failing that, you can just ignore the uninteresting links afterwards).

Then, despite the fact that Stackoverflow tells you not to use regular expressions to parse HTML, you should use regular expressions to parse HTML, because it's only a homework assignment, right?

If it's not a homework assignment, you've not chosen the best tool for the job.

Upvotes: 2

How to extract data (user name)from webpage

Answers (4)

Related Questions