[Thali-talk] Improving attachment perf for Thali

Yaron Goland yarong at microsoft.com
Mon May 4 14:11:50 EDT 2015


I had a great chat with Nolan Lawson from PouchDB over email about attachment perf issues on PouchDB. I'm now going to try and summarize that conversation. I'm including Nolan so he can throw things at me if I'm wrong.


Nolan, for awhile, has been suggesting that maybe we just shouldn't use PouchDB for attachments. Instead he suggested something more like NPM where they keep their JSON data in CouchDB and then have pointers to another place where the attachments are stored. The benefit of this approach is that one can use mechanisms like rsynch to handle synch'ing the binaries.


The problem with this approach is that it makes life very complicated for devs. Devs have to put JSON in one place, use a different mechanism to store attachments and then make sure the two stay in 'sync'. E.g. no hanging pointers. Also when it comes time to synch with a remote location one has to use two different synch protocols, CouchDB for the JSON data and Rsynch (or whatever) for binary.


Since Thali's goal is to provide an easy developer experience we inevitably would end up having to wrap this architecture in our own API which would probably look just like PouchDB's attachment API so that as a Thali developer you just see JSON and attachments and none of the complexity underneath.


But of course the complexity is still there and means a lot of work for us to basically replicate what CouchDB already does, which is keep attachments and docs consistent.

So the net/net in my mind is that the best approach is to just fix PouchDB rather than introduce a completely new synch protocol in the form of Rsynch or whatever.

"Fixing" PouchDB attachment handling does require a few steps though. Minimally we need to put in atts_since support which lets us say "only send me an attachment if it hasn't changed" and we also need mime/multipart (so we don't have to serialize attachments to strings).

There are other changes we can also make but we should drive those based on specific perf tests. For example:

1 - We could put in delta-synch support so if we find we are making lots of small changes to attachments we could send the delta rather than the whole attachment. This scenario would show up if we find ourselves storing things like word processing or spreadsheet files. There is actually a RFC for this over HTTP, RFC 3229.

2 - If we have a lot of identical attachments (e.g. have the same hash) then we could teach the synch engine how to recognize docs that come in with attachments whose hashes we already have and so we can just link them without having to download the content again.

There are other changes but those are even more speculative.

So our general approach is going to be to focus on improving PouchDB's attachment handling over trying to introduce another mechanism.

Thoughts?

        Yaron

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist10.pair.net/pipermail/thali-talk/attachments/20150504/b800d0b3/attachment.html>


More information about the Thali-talk mailing list