Lukasz Kujawa
Lukasz Kujawa

Reputation: 3096

Managing big array in couchdb

I use CouchDB to store crawled websites. For example:

{
   "_id": "doc-http:80-example.com/2012/09/",
   "_rev": "2-532ce885cdb56261cb6d21903cd74c56",
   "contentType": "text/html; charset=UTF-8",
   "lastModified": "2013-11-22T17:41:33.471Z",
   "schema": "document",
   "hostname": "example.com",
   "uri": "/2012/09/",
   "port": 80,
   "protocol": "http:",
   "source": [
       "http://example.com/page/1",
       "http://example.com/page/2",
   ],
   "_attachments": {
       "content": {
       }
   }
}

"source" element is an array which stores all pages linking to that particular page. The array can grow very quickly and I don't want to GET and PUT the whole document every time I want to add only one link.

Is it possible to update the document and insert another link to source without re-sending the whole "source"?

Upvotes: 0

Views: 94

Answers (2)

Daniel
Daniel

Reputation: 8388

Have you checked about update handlers? http://wiki.apache.org/couchdb/Document_Update_Handlers

Not done it myself but I've read about that you should be able to use it to patch documents.

Upvotes: 3

Mike Rhodes
Mike Rhodes

Reputation: 1836

A further option is to use one document per source and destination URL, rather than one document per destination URL with a long list of sources.

{
    ...
    "sourceUrl": "https://example.com/page/1",
    "targetUrl": "https://target.com/page"
}

You would then use a view to get the list of all the source URLs that are pointing to a given target URL:

function(doc) {
    emit(doc.targetUrl, doc.sourceUrl);
}

You could use a _count reduce to quickly retrieve a count of the inbound links to a target page this way too, pre-calculating this for display in your UI.

Furthermore, emit(doc.sourceUrl, doc.targetUrl); would give you a view easily queryable for the links outwards from a given source.

Upvotes: 1

Related Questions