Reputation: 365
I want to collect user names from member-list pages like this: http://www.marksdailyapple.com/forum/memberslist/
I want to get every username from all the pages,
and I want to do this in linux,with bash
where should I start,could anyone me some tips?
Upvotes: 0
Views: 3527
Reputation: 370
As Robin suggest, you should really do this kind of stuff within a programming language containing a decent html-parser. You can always use command-line tools do various tasks, however in this case I probably would have chosen perl.
If you really want to try to do it with command-line tools i would suggest, curl, grep, sort and sed.
I always find it easier when I have something to play with, so here's something to get you started.
I would not use this kind of code to produce something useful though, but just so you could get some ideas.
The memberpages seems so be xxx://xxx.xxx/index1.html, where the 1 is indicating the page-number. Therefore the first thing I would do is to extract the number of the last memberpage. When I have that I know which urls I want to feed curl with.
Every username is in a member of the class "username", with that information we can use grep to get the relevant data.
#!/bin/bash
number_of_pages=2
curl http://www.marksdailyapple.com/forum/memberslist/index[1-${number_of_pages}].html --silent | egrep 'class="username">.*</a>' -o | sed 's/.*>\(.*\)<\/a>/\1/' | sort
The idea here is to give curl the addresses in the format index[1-XXXX].html, that will make curl traverse all the pages. We then grep for the username class, pass it to sed to extract relevant data (the username). We then pass the produced "username-list" to sort to get the usernames sorted. I always like sorted things ;)
Big Notes though,
Hope that helps !
Upvotes: 2
Reputation: 16927
This is what my Xidel was made for:
xidel http://www.marksdailyapple.com/forum/memberslist/ -e 'a.username' -f '(//a[@rel="Next"])[1]'
With that simple line it will parse the pages with a proper html parser, use css selectors to find all links with names, use xpath to find the next page and repeat it until all pages are processed
You can also write it using only css selectors:
xidel http://www.marksdailyapple.com/forum/memberslist/ -e 'a.username' -f 'div#pagination_top span.prev_next a'
Or pattern matching. There you basically just copy the html elements you want to find from the page source and replace the text content with {.}
:
xidel http://www.marksdailyapple.com/forum/memberslist/ -e '<a class="username">{.}</a>*' -f '<a rel="next">{.}</a>'
Upvotes: 7
Reputation: 8819
I used this bash script to go through all the pages:
#!/bin/bash
IFS=$'\n'
url="http://www.marksdailyapple.com/forum/memberslist/"
content=$(curl --silent -L ${url} 2>/dev/null | col -b)
pages=$(echo ${content} | sed -n '/Last Page/s/^.*index\([0-9]\+\).*/\1/p' | head -1)
for page in $(seq ${pages}); do
IFS=
content=$(curl --silent -L ${url}index${page}.html 2>/dev/null | col -b)
patterns=$(echo ${content} | sed -n 's/^.*class="username">\([^<]*\)<.*$/\1/gp')
IFS=$'\n' users=(${patterns})
for user in ${users[@]}; do
echo "user=${user}."
done
done
Upvotes: 1
Reputation: 33083
First you should use wget
to get all the username pages. You will have to use some options (check the man page for wget
) to make it follow the right links, and ideally not follow any of the uninteresting links (or failing that, you can just ignore the uninteresting links afterwards).
Then, despite the fact that Stackoverflow tells you not to use regular expressions to parse HTML, you should use regular expressions to parse HTML, because it's only a homework assignment, right?
If it's not a homework assignment, you've not chosen the best tool for the job.
Upvotes: 2