I need to design a program that finds certain four or five word phrases across the entire wikipedia collection of articles (yes, I know it's lot of pages, and I don't need answers calling me an idiot for doing this). I haven't programmed much stuff like this before, so there are two issues that I would greatly appreciate some help with: First, how I would be able to get the program to crawl through all of the pages (i.e NOT hardcoding each one of the millions of pages. I have downloaded all the articles onto my hard drive, but I'm not sure how I can tell the program to iterate through each one in the folder) EDIT - I have all the wikipedia articles on my hard drive The snapshots of the pages have pictures and tables in them. How would I extract solely the main text of the article? Your help on either of the issues is greatly appreciated!

Reputation: 31

Crawling all wikipedia pages for phrases in python

I need to design a program that finds certain four or five word phrases across the entire wikipedia collection of articles (yes, I know it's lot of pages, and I don't need answers calling me an idiot for doing this).

I haven't programmed much stuff like this before, so there are two issues that I would greatly appreciate some help with:

First, how I would be able to get the program to crawl through all of the pages (i.e NOT hardcoding each one of the millions of pages. I have downloaded all the articles onto my hard drive, but I'm not sure how I can tell the program to iterate through each one in the folder) EDIT - I have all the wikipedia articles on my hard drive
The snapshots of the pages have pictures and tables in them. How would I extract solely the main text of the article?

Your help on either of the issues is greatly appreciated!

Upvotes: 3

Answers (4)

Jeff Tratner

Reputation: 17106

The snapshots of the pages have pictures and tables in them. How would I extract solely the main text of the article?

If you are okay with finding the phrases within the tables, you could try using regular expressions directly, but the better choice would be to use a parser and remove all the markup. You could use Beautiful Soup to do this (you will need lxml too):

from bs4 import BeautifulSoup
# produces an iterable generator that returns the text of each tag in turn
gen = BeautifulSoup(markup_from_file, 'xml').stripped_strings
list_of_strings = [x for x in gen] # list comprehension generates list
' '.join(list_of_strings)

BeautifulSoup produces unicode text, so if you need to change the encoding, you can just do:

list_of_strings = map(lambda x: x.encode('utf-8'),list_of_strings)

Plus, Beautiful Soup can help you to better navigate and select from each document. If you know the encoding of the data dump, that will definitely help it go faster. The author also says that it runs faster on Python 3.

Upvotes: 2

scorpiodawg

Reputation: 5752

You asked:

I have downloaded all the articles onto my hard drive, but I'm not sure how I can tell the program to iterate through each one in the folder

Assuming all the files are in a directory tree structure, you could use os.walk (link to Python documentation and example) to visit every file and then search each file for the phrase(s) using something like:

for line in open("filename"):
    if "search_string" in line:
        print line

Of course, this solution won't be featured on the cover of "Python Perf" magazine, but I'm new to Python so I'll pull the n00b card. There is likely a better way to grep within a file using Python's pre-baked modules.

Upvotes: 0

SingleNegationElimination

Reputation: 156188

bullet point 1: Python has a module just for the task of recursively iterating every file or directory at path, os.walk.

point 2: what you seem to be asking here is how to distinguish files that are images from files that are text. the magic module, available at the cheese shop, provides python bindings for the standard unix utility of the same name (usually invoked as file(1))

Upvotes: 0

Kien Truong

Reputation: 11381

Instead of crawling page manually, which is slower and can be blocked, you should download the official datadump. These don't contain images so the second problem is also solved.

EDIT: I see that you have all the article on you computer, so this answer might not help much.

Upvotes: 6

Crawling all wikipedia pages for phrases in python

Answers (4)

Related Questions