Reputation: 2067
I'm working on an application that routinely gathers information from a large number of websites and saves it to a mysql database with a table for each site. The idea is to create a sort of customizable news feed.
I want the ability to request any number of news items and filter out certain sites too. As an example:
Show newest 100 items but exclude items from Twitter and GitHub.
It seems like the best way to handle this is to create a table that just has foreign keys and website names.
master_table(id, website, date, foreign_key)
and I can just query the foreign ids I need from this table.
Am I going about this horribly wrong?
Upvotes: 1
Views: 292
Reputation: 37354
Some ideas which you may find useful:
As you said, you can put all common properties in one table :
master_table(id, website, date, foreign_key,url)
Then create another table:
details_tables(id, master_id, property_name, property_value, property_type)
which is used to retrieve data specific to each site. I believe it will give more flexibility than storing everything in json . If you expect many records in these tables, consider using partitioning.
You can also keep the existing table structure, but add a view :
SELECT 'Stack Overflow' as site_name, id, url, date
FROM stackoverflow_table
UNION
SELECT 'Reddit' as site_name, id, url, date
FROM reddit_table
...etc
Upvotes: 1
Reputation: 3826
I've actually been working on a similar site. Not for other sites, but a kind of Facebook-like site for a niche community with newsfeeds from various sources. I've been pondering this question very heavily the past couple of weeks.
One issue, probably not gamebreaking, but still an issue for me, is that since your foreign_key
column isn't literally a foreign key due to referencing multiple tables, so it can't get the benefits from things such as referential integrity enforcement.
What I'm considering is making a GUID table that serves as the source of ids for all of the other tables, and having a table specifically dedicated to the news feed. It might be defined as something like:
CREATE TABLE sources (
id INTEGER AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(20),
url VARCHAR(50) );
INSERT INTO sources (name, url) VALUES ('Stack Overflow', 'http://stackoverflow.com');
INSERT INTO sources (name, url) VALUES ('Reddit', 'http://www.reddit.com');
INSERT INTO sources (name, url) VALUES ('Github', 'http://github.com');
INSERT INTO sources (name, url) VALUES ('Twitter', 'http://twitter.com');
/* And so on... */
CREATE TABLE newsfeed (
guid INTEGER PRIMARY KEY, -- Note: NOT auto_increment!
source_id INTEGER NOT NULL,
timestamp DATETIME NOT NULL,
text VARCHAR(256),
ref_id INTEGER NOT NULL,
FOREIGN KEY (source_id) references sources (id) );
You could still store information about site postings in their own tables, but now you're just referencing the one newsfeed table for what to actually display on the page, with the ref_id being a pointer to individual source tables if someone wants to deep-dive into the information. It's still not ideal because ref_id still isn't a true foreign key, but it's arguably a little better.
You might even want to do something like this instead of ref_id:
data TEXT,
with the contents of that column for any given entry a source-specific data payload. For example, for Github posts, it could contain a JSON string such as:
{"commit_message":"Updated global variable namespace.",
"author":"King Skippus","repository":"Ibuware"}
Upvotes: 1
Reputation: 1981
A table for each site is the issue. As more sites come online you will end up with too many links.
Get a DBA/Architect involved to fix your data model and your links issue will disappear.
Upvotes: -1