Reputation: 29452
I want to crawl a website and store the content on my computer for later analysis. However my OS file system has a limit on the number of sub directories, meaning storing the original folder structure is not going to work.
Suggestions?
Map the URL to some filename so can store flatly? Or just shove it in a database like sqlite to avoid file system limitations?
Upvotes: 3
Views: 3133
Reputation: 75185
It all depends on the effective amount of text and/or web pages you intent to crawl. A generic solution is probably to
The advantage of this approach is that the DBMS remains small, but is available for SQL-driven queries (of an ad-hoc or programmed nature) to search on various criteria. There is typically little gain (and a lot of headache) associated with storing many/big files within the SQL server itself. Furthermore as each page gets processed / analyzed, additional meta-data (such as say title, language, most repeated 5 words, whatever) can be added to the database.
Upvotes: 5
Reputation: 6964
Depending on the processing power of the PC which will do the data mining, you could add the scraped data to a compressible archive like a 7zip, zip, or tarball. You'll be able to keep the directory structure intact and may end up saving a great deal of disk space - if that happens to be a concern.
On the other hand, a RDBMS like SqLite will balloon out really fast but wont mind ridiculously long directory hierarchies.
Upvotes: 1
Reputation: 29145
Having it in a database will help search through the content and page matadata. You can also try in-memory databases or "memcached" like storage to speed in up.
Upvotes: 1