[Thali-talk] How to handle disconnect for stories -1 and 0

Thu Jun 25 22:27:30 EDT 2015

Warning: Still on the public alias. :)

Executive Summary: Oguz's suggestion to make the native code brain dead simple and handling muxing at the node.js layer worked out beautifully when Matthew pointed me to a "Brilliant!" library<https://github.com/maxogden/multiplex> that can handle muxing and demuxing many TCP/IP connections down to one TCP/IP connections. I wrote sample code<https://github.com/thaliproject/Thali_CordovaPlugin/blob/345803b3f059ee21b5ad9ac7122cb04f32ef00fc/test/sockettest.js#L306> to test this on Android and it worked really well using loop back. I still need to test error handling and use it across Jukka's Bluetooth code but it's looking good. So this strongly argues that we should get rid of disconnect all together and the native code should have the logic that when connect is called it creates a P2P connection and holds it. Once a TCP/IP connection is made locally and then cut that is when the P2P connection should be removed.

Long Winded Version:

So per our conversation today I looked into the whole issue of disconnect and how to deal with the framing problem.

For those following along at home the framing issue is something I refer to in the email this is a response to obliquely but it is really important because it will drive everything in this email.

Imagine the following setup:

Phone 1: HTTP Client --> TCP/IP Client Bridge -->
P2P Connection -->
Phone 2: TCP/IP Server Bridge -->HTTP Server

So phone 1 has a HTTP client that is talking over the local P2P Connection (Bluetooth for Android, for example) to phone 2's HTTP server.

Now imagine the following time line:

Time 0 - The HTTP Client sends a HTTP request over its local TCP/IP link to its local TCP/IP Client Bridge.

Time 1 - The TCP/IP Client Bridge sends the request over the local P2P Connection to the remote machine's TCP/IP Server Bridge who then forwards the request to the remote HTTP Server.

Time 2 - The HTTP Client terminates its connection to the local TCP/IP Client Bridge. This could be due to a spurious error, for example.

Time 3 - The HTTP Client re-connects its TCP/IP connection to the local TCP/IP Client Bridge.

Time 4 - The HTTP server finally gets around to processing the HTTP request it received at time 0 and sends a response.

Time 5 - The HTTP Client, who hasn't sent any requests over its new TCP/IP connection to the local TCP/IP Client Bridge suddenly receives a HTTP response!!

This is obviously broken. The problem here is that the HTTP client thinks it has a brand new connection and any old state is gone. The HTTP Server, on the other hand, had no idea that the HTTP Client had created a new connection and thinks the old connection still exists. The results are.... Bad.

So we have to avoid this.

For story -1 I proposed avoiding this in a very expensive way. Whenever the TCP/IP connection to the TCP/IP Client Bridge is lost then the TCP/IP Client Bridge would immediately tear down the P2P connection.

The good news is that this would definitively tell the TCP/IP Server Bridge that all existing connections are dead and would thus signal the TCP/IP Server Bridge to terminate any connections it had to the HTTP Server. Now everyone is very clear that if a new connection is created then we are starting anew.

The bad news is that the time to establish a Bluetooth connection is anywhere from 2-3 seconds. Given the limited time windows we have to do replication that is forever!

Oguz made a good suggestion. He said that what we should do is introduce a framing layer that lives in Node.js and leave the native connections dumb. It turns out his suggestion should work incredibly well. It's just that explaining how is a bit complicated.

So Jukka, Um... please.... Bear with me? I know this is painful.

The new architecture would look like:

Phone 1:
Node.js - [HTTP Client --> TCP/IP Client Mux Bridge -->]
Native - [TCP/IP Client Bridge -->]

P2P Connection -->

Phone 2:
               Native - [TCP/IP Server Bridge -->]
               Node.js - [TCP/IP Server Mux Bridge --> HTTP Server]

Now, for Jukka and David, the good news is that this design makes their lives much easier. All they have to implement is the TCP/IP Client and Server Bridges. These things are dumb as a post! Their logic is as I described in the -1 story. That is:

TCP/IP Client Bridge - Created with a call to connect(peerId, callback). This creates a new instance of the TCP/IP Client Bridge. This will cause a P2P connection to the identified peerID. The callback will return a port. The port will accept exactly one connection. Attempts to make more than one connection must be rejected. If, FOR ANY REASON, the established TCP/IP connection to that port is torn down then the P2P connection must be torn down.  If the P2P connection is lost then the TCP/IP connection must also be killed. This means there is no disconnect method. Just connect. When the client is done, they just kill the TCP/IP connection.

TCP/IP Server Bridge - This is created as a side effect of StartBroadcasting. Whenever a connection is received from the P2P transport this will cause the TCP/IP Server Bridge to establish a TCP/IP client connection to the TCP/IP Client Mux Bridge whose address will be passed in as the portNumber argument of StartBroadcasting. So when device 1 connects to device 2 then on device 2 there will be a single TCP/IP connection to portNumber. If device 3 connects to device 2 then there will be another TCP/IP connection to portNumber. Each device gets 1 and only 1 TCP/IP connection to portNumber. If the P2P connection to a particular device is lost then its associated TCP/IP connection must be torn down.

That's it. That's all the native code has to do.

Now the obvious question is - what about the TCP/IP Client Mux Bridge and the TCP/IP Server Mux Bridge. How do they work?

The answer is that the TCP/IP Client Mux Bridge sets up a local host listening endpoint that accepts as many TCP/IP connections as the local clients want to send. It will then mux all these TCP/IP connections onto a single TCP/IP connection which it will then send down the pipe to the TCP/IP Client Bridge. The TCP/IP Server Mux Bridge does the inverse. It takes the single TCP/IP connection per device output by the TCP/IP Server Bridge and de-muxes it and turns them into individual TCP/IP connections.

Originally I was going to make this all work using Web Sockets or maybe HTTP/2.0. But Matt, bless his heart, saved me from my own addiction to complexity by pointing me to https://github.com/maxogden/multiplex. And, as Jukka would say, IT'S BRILLIANT! Or at least it has been so far in my rather simple testing. You can see my work in story-1-yarong, specifically in sockettest.js at this line<https://github.com/thaliproject/Thali_CordovaPlugin/blob/345803b3f059ee21b5ad9ac7122cb04f32ef00fc/test/sockettest.js#L306>.

This test does a number of really interesting things. The first interesting thing it does is wait. See that 1 second timeout? That timeout occurs after the muxClientBridge forms a TCP/IP connection to the muxServerBridge. I ran this code on Android using JXCore and it ran fine. Meaning that the connection did not time out!! This was a concern because of Node's aggressive behavior in cleaning up TCP/IP connections but apparently this only applies to HTTP connections, not sockets. Second, it runs three different TCP/IP clients in parallel. Each of those clients creates a separate connection to the muxClientBridge. The muxClientBridge then muxes all those multiple TCP/IP connections down to one single TCP/IP connection which it then connects to the muxServerBridge. The muxServerBridge then takes the TCP/IP connections it receives via the mux connection and de-muxes them to make three separate TCP/IP connections to the local TCP/IP server.

And this all worked on Android!!!!!

I still need to hook it up to Jukka's code just to see it work across Bluetooth but so far it's looking really good. I still have to do more testing to make sure I can reliably detect things like streams closing. But it does seem to work so I think we should go with the simpler proposal for the native code assuming I successfully get this working on Bluetooth.

               Thanks,

                              Yaron

From: Yaron Goland
Sent: Wednesday, June 24, 2015 9:55 PM
To: Jukka Silvennoinen; Matthew Podwysocki; David Douglas
Cc: thali-talk at thaliproject.org
Subject: How to handle disconnect for stories -1 and 0

Warning: This is to the public alias. Please respond here but just keep that in mind.

The good news is that I actually got Story -1 to pass on Android, more or less. I need to tweak it a bit but the fundamentals did work from my Samsung to my Nexus. I am having a problem with the Nexus talking to the Samsung (discovery works but when the Nexus tries to connect it gets an io error) but given how old the Samsung S4 running Jelly Bean is I don't care all that much. I'll be getting two new phones soon.

That having been said playing around with the code made me realize that I was wrong in my proposal to get rid of disconnect. The reason is that we are going to have all sorts of nasty race conditions and make life harder for our customers without it.

I think Jukka had the right logic all along.

The requirements would be:

The TCP client bridge defines a state model with the following states:

Bound

Unbound

When the connect function is called with a peer ID and a callback then a TCP client bridge instance MUST be created and bound to a local TCP/IP port. >From the moment the local TCP/IP port is bound the TCP client bridge associated with the submitted peer ID is in the bound state.

When the TCP client bridge enters the bound state it MUST call the associated connect callback with the local port it is bound to.

The TCP client bridge MUST form a local P2P connection either when it enters the bound state or when there is a connection to its local TCP/IP port. The exact choice is implementation dependent and based on how long it takes to establish a local P2P connection.

If the local P2P connection is lost while there are connections to the local TCP/IP port then the open sockets MUST all throw an exception and dispose of the connected sockets.

If the TCP client bridge loses the local P2P connection for any reason while in the bound state then it MUST either immediately reconnect over the local P2P transport to the associated peer ID or wait until there is a new connection to the TCP client bridge's local TCP/IP port (since, per the previous requirement, any existing TCP/IP connections would have been terminated with an exception).

So long as the TCP client bridge is in the bound state it MUST maintain control over the port it returned in the connect callback.

The TCP client bridge MAY restrict the number of incoming TCP/IP connections to the local port to 1 or more. If attempts are made to create more connections than can be supported by the local P2P connection then the additional TCP/IP connection requests MUST be rejected.

Note that the previous requirement effectively mandates the use of a semaphore or equivalent facility in the native code that can guarantee that at any instance only the maximum allowed number of connections and no more are allowed.

Open Issue: Although it's theoretically possible to run multiple simultaneous named streams using the iOS multi-peer framework I'm not convinced that is worth worrying about right now. As such unless we really need it, it seems to me that we should limit iOS the same way we are forced to limit Android (unless we want to invent our own mux layer) to 1 connection total. If, however, it turns out we need more connections then we will need a local function that the Node code can call to find out how many connections will be available. This is necessary for setting up the node.js level connection pools.

If a TCP/IP connection to the TCP client bridge local port is lost for any reason then future connections MUST be treated as completely brand new. In practice this means that if the local TCP/IP connection is lost then either the local P2P connection has to be torn down (to make sure all associated state is lost) or some kind of explicit clear signal has to be sent to make sure the remote server understands that the coming bytes form a new connection. The point of this is to prevent a situation where a client has a TCP/IP connection to the TCP client bridge, sends some bytes, loses the connection for some reason, reconnects and the server on the other side has no clue any of this happens and think that the client is just continuing the previous connection. This kind of behavior will lead to data corruption.

If the connection function is called on a peer ID with a TCP client bridge in the bound state then the callback MUST return an error.

If the disconnect function is called with a peer ID and a call back on a peer ID that is not associated with a TCP client bridge then the callback MUST return success.

If the disconnect function is called with a peer ID and a call back on a peer ID that is associated with a TCP client bridge in the bound state then the TCP client bridge MUST move to the unbound state, reject all future connections to the local TCP/IP port, throw an exception and close all existing connections to the local TCP/IP port and only once this is all done call the callback with success. Once the callback has been successfully delivered to JXCore the TCP client bridge MAY release its local TCP/IP port. It is considered best practice to hold onto the local TCP/IP port for a minute or two just to guarantee that there will be no confusion about the peer ID the port is associated with.

If the disconnect function is called with a peer ID and a call back on a peer ID that is associated with a TCP client bridge in the unbound state then the callback MUST be immediately executed with a success result.

Note, if necessary, we can do a two stage commit where the node code calls disconnect, the native code calls the call back and then the node code calls a native function confirming the receipt of the callback at the application level. But I'm hoping this isn't necessary and so not specifying it for now. It's easy to add later if we need it.

It's worth pointing out that if we agree with this proposal then I will use these requirements to validate and as necessary write tests to make sure we have the right behavior. This set of interfaces is at the very root of our local P2P stack. It needs to be rock solid. So there will need to be lots of tests to be sure.

               Thanks,

                                             Yaron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist10.pair.net/pipermail/thali-talk/attachments/20150626/6800c6ed/attachment-0001.html>