Before talking about managing fault tolerance, we should take a brief look at the current PIM architecture. Here we have a simplified statechart that points out the different phases of the PIM's life.
As we can see, the current approach doesn't consider the possibility of network or hardware failure. In fact, after a first initialization phase, each node tries to connect to his next and previous nodes, using the settings loaded from the configuration file. In the case of failure, execution cannot continue because the ring is broken and nothing can fix it.
Here it is a summary of what we would like to achieve:
- implement a mechanism capable of detecting the arrival of new nodes or the failure of existing ones
- manage the failure of a node which was waiting for the CP
- manage the failure of a node which was executing the CP
- maintain a backup copy of the CP on each node
Our first effort consisted in integrating the GroupManager in the PIM, so that we could manage PublicPeerGroups.
After this phase, we are currently trying to modify the PIM architecture in order to achieve a fully dynamic procedure to establish connections, so that we don't have to rely on the configuration file. Our idea is to exploit the GroupManager functionality and create an ordered list of peers. This list is local to each node and will constantly be updated after each newPeer and deadPeer event from GroupManager.
Another possible modification which we are working on is a mechanism to identify which node currently holds the most recent backup version of the CP. As for now, we would like to include the CP version number in each GroupManager packet that a node sends to the others. This should allow us to know what version is currently stored on each node, by simply looking at the ordered list of peers.
These modifications (the GroupManager integration, the fully dynamic connect procedure and the CP backup copy versioning) should allow us to reach our objectives, as presented in the following statechart.
The Latest CP Search state is run on each node and it performs the analysis of the CP version number listed in the peer list, based on the UUID of each node. In order to transmit this information along with the GroupManager packets, we are thinking to exploit the possibility to change the peer name. In other words, we'd like to use the GroupManager.setNodeName(String) method, every time a node updates its CP version. This way, we can make the GroupManager transmit this extra information without modifying the code, and thus maintaining full compatibility with the next GroupManager versions.
As for the notification of new or dead nodes, we have defined a static method inside the PIM_Runtime that should be executed by the GroupManagerListener. First of all, this method interrupts the PIM runtime thread, unblocking it from the read of the prevNodeSocket. Then it should prepare the system recovery. As for now, we choose to focus our attention on this aspects, so that the reconnection phase will involve every node, not only the two neighbors of the new/dead node.
- we use the nodeName (given by the GroupManager to each node) to store the version number of the backup CP. We admit this is not a very fair approach, but we chose to leave the GroupManager unchanged, in order to maintain compability.
- to achieve better responsiveness, we should force the GroupManager to use very short ping intervals. This increases the network traffic, but instead of relying on broadcasts we could make the GroupManager ping very often only the nodes present in the list.
- the fully dynamic connection procedure leads to a potential drawback: the notion of node number loses its meaning. In the traditional approach, every node had a node number associated to it, according to the configuration file. Applications, like the CaptureTheFlag, strongly rely on this number to identify different nodes. In a dynamic environment, where nodes can join or leave, the binding between the node and its position in the list may change quite often. A possible solution could be using the GroupManager's UUID instead of the number taken from the configuration file. Other thoughts on this topic can be found here.
Observations and thoughts
The notification of a dead node cannot be done just by looking at the sockets. The reason is that if a node fails in a bad way (e.g. hw failure or sudden disconnection from the network) the socket remains open for a long time. Two other approaches are possible:
- the ACK-approach: each node sends an ACK to the previous node after it has successfully sent the CP to its next node. However, such an approach cannot detect the death of a node which has already sent its ACK and which is only waiting for the CP.
- another approach we are considering is establishing another connection ring, in the opposite direction, only for service messages. This solution has the drawback that identifying the most recent backup copy of the CP is not a trivial task.
The fully dynamic connect procedure has some interesting consequences: exploiting the UUID list we could dynamically choose not only how many nodes should be present in our connection ring, but also how many nodes need to be present before the CP start and the maximum number of possible members.
Furthermore, we should dynamically change the ping interval of the GroupManager (and obviously the time after which a non-sending peer is considered dead). In particular, we could choose this interval depending on the roundtrip time. This will lead to more relaxed intervals for longer RTT, and to higher responsiveness only when it is really needed.