Reputation: 3096
I use CouchDB to store crawled websites. For example:
{
"_id": "doc-http:80-example.com/2012/09/",
"_rev": "2-532ce885cdb56261cb6d21903cd74c56",
"contentType": "text/html; charset=UTF-8",
"lastModified": "2013-11-22T17:41:33.471Z",
"schema": "document",
"hostname": "example.com",
"uri": "/2012/09/",
"port": 80,
"protocol": "http:",
"source": [
"http://example.com/page/1",
"http://example.com/page/2",
],
"_attachments": {
"content": {
}
}
}
"source
" element is an array which stores all pages linking to that particular page. The array can grow very quickly and I don't want to GET and PUT the whole document every time I want to add only one link.
Is it possible to update the document and insert another link to source without re-sending the whole "source
"?
Upvotes: 0
Views: 94
Reputation: 8388
Have you checked about update handlers
? http://wiki.apache.org/couchdb/Document_Update_Handlers
Not done it myself but I've read about that you should be able to use it to patch documents.
Upvotes: 3
Reputation: 1836
A further option is to use one document per source and destination URL, rather than one document per destination URL with a long list of sources.
{
...
"sourceUrl": "https://example.com/page/1",
"targetUrl": "https://target.com/page"
}
You would then use a view to get the list of all the source URLs that are pointing to a given target URL:
function(doc) {
emit(doc.targetUrl, doc.sourceUrl);
}
You could use a _count
reduce to quickly retrieve a count of the inbound links to a target page this way too, pre-calculating this for display in your UI.
Furthermore, emit(doc.sourceUrl, doc.targetUrl);
would give you a view easily queryable for the links outwards from a given source.
Upvotes: 1