oldboy
oldboy

Reputation: 5964

Database Structure, Storing and Organizing Data

I'm creating a project that tracks crowd-funding campaigns, so that the end user can analyze the data. Obviously I'm using a spider to scrape all of the details of each campaign at regular intervals, which I will then store in a database.

I'm just not sure how I should design the database. The issue is that campaigns can have lifespans of over a month, and I may be scraping each campaign multiple times a day to check for changes.

Lumping every campaign into a single table would be impractical since there will be thousands of campaigns and, theoretically, a single campaign might have hundreds and hundreds of rows if its details are constantly being updated. There will likely be tens and tens of columns too. Hence I've considered creating separate tables for each campaign.

At the same time, having thousands of tables also seems impractical, especially if the user wanted to compare a handful of different campaigns. In order to compare many campaigns, I would then have to query an indefinite number of tables.

I've never dealt with such complexity before. Does anybody have any idea how to approach this problem?

Potential Fields

CREATE TABLE campaign (
  id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  campaign_url VARCHAR(255) NOT NULL,
  campaign_phase VARCHAR(8) NOT NULL,
  project_website VARCHAR(255) NOT NULL,
  project_facebook_url VARCHAR(255) NULL,
  project_linkedin_url VARCHAR (255) NULL,
  project_twitter_url VARCHAR(255) NULL,
  project_youtube_url VARCHAR(255) NULL,
  product_title TEXT NOT NULL,
  product_tagline TEXT NOT NULL,
  product_phase VARCHAR(10) NULL,
  product_overview TEXT NULL, # may be more columns derived from overview...
  owner_name VARCHAR(255) NOT NULL,
  owner_title VARCHAR(255) NOT NULL,
  owner_description TEXT NULL,
  owner_avatar_url VARCHAR(255) NULL,
  owner_location VARCHAR(255) NOT NULL,
  owner_campaign_count TINYINT NOT NULL,
  owner_total_raised INT NOT NULL,
  owner_other_campaign_urls TEXT NOT NULL, # this may have multiple values...
  owner_contribution_count TINYINT NOT NULL,
  owner_verified BIT NULL,
  # info about team members...
  # info about perks...
  # info about/meta-analysis of campaign text, images, and videos...
  # info about updates...
  # info about backers...
  crawled_on DATETIME NOT NULL
)

Notably, I'm thinking about isolating the sections that are represented by comments since many of those sections may or may not contain information in various quantities. Also, the fields with VARCHAR(255) may need to be a different datatype.

Upvotes: 0

Views: 177

Answers (2)

Rick James
Rick James

Reputation: 142518

(Partial answer)

Hundreds of columns similar to what you show is likely to be problematic. I recommend you consider splitting in a few ways.

  • "Team members" sounds like a list of people, not a single person. So, that is necessarily a separate table, joined 1:many. Similarly "images" sounds like an open-ended list.
  • Relative static data should (perhaps) be segregated from the frequently-updated data.
  • Figure out what your SELECTs will look like. If some of them look at, say, "product" columns but not "owner" columns, then it will probably be beneficial to split out one clump of columns or the other.

Hundreds, even Millions of rows in a single table is not a problem. Hundreds of columns in a single table is treading on thin ice.

Blindly using (255) is likely to bite you.

If you are crawling different sites, it is extremely likely that the format and composition of the data you get will vary from site to site. (I've done such with news sites -- it is a fulltime job.)

Bottom line: There is no simple, obvious, answer to your Question. You will have challenges.

Upvotes: 1

Boris Schegolev
Boris Schegolev

Reputation: 3701

Stick with a normalized schema. One table for the content you describe is good, unless we are talking about some EXTREME amounts of data. IMHO, in the latter case MySQL would not be a very good choice anyway.

Keep it simple: design one table, choose correct datatypes, avoid NULL-able columns (you mentioned "tens and tens of columns", what for?) and index the data properly. You can't miss.

Upvotes: 0

Related Questions