The 1PercentPool guide to a worry-free stake pool operation

For the benefit of the Cardano community, we’ll share our experience on running 1PercentPool. There have been several issues to overcome, and we’ve figured out methods to overcome these and mitigate issues when they come up.

Over time, when Jormungandr will be fully developed and optimized, these methods will no longer be needed. They are in place to compensate for the issues in the current beta versions.

Current issues

  1. When a node is stuck and running behind, you have to restart the pool. After a restart, the pool has to bootstrap from one of currently 14 nodes provided by IOHK, before it can continue its operation. (This was a simple way for IOHK to get the testnet launched – it will definitely be fixed before mainnet, so the network does not rely on 14 nodes to be operational.) All Daedalus wallets are currently connecting to 7 of these 14 nodes when launching the wallet, which causes a huge load on the nodes. The nodes get overloaded and some get stuck, which is why people are having so many issues with getting the wallet up and running. It can take hours to get your stake pool restarted when it’s stuck or behind, which cause pools to miss their blocks.
  2. A node can fall behind the tip for many reasons. The node is supposed to recover by itself, but sometimes it’s just permanently a few blocks behind, or it will get completely stuck. With just 1 node running, it’s hard to determine if the node is behind or not, and when it needs to be restarted.
  3. Parts of the node can crash and be unable to recover and keep up with the chain. When this happens, the node needs to get restarted in order to keep up and produce blocks again.
  4. For blocks to be included in the blockchain, they need to get propagated through the network before the next block gets created. You can have a huge number of connections, but this increases the load on your CPU and network bandwidth. Imagine trying to upload a file to 2000 servers at the same time; each connection will only have 1/2.000th of the bandwidth available, and it will take a lot longer time before the 2000 nodes will have received the block.

Mitigation methods

The best way to prevent Sybil attacks (one pool gaining control of more than half the stake in the network) is for everyone to run their own pools! Once the node is fully optimized, it will be able to run on very cheap hardware, and hence it will be cheaper for people to run their own set-and-forget pool than to pay 1% of their rewards to a stake pool.

Private bootstrap cluster

There is no technical reason why you should use the 14 IOHK pools to bootstrap your node. You can bootstrap from any node! This is called a trusted node and requires that the node has a public_id set in it’s config file. Then other nodes can add it in their config file, with the ip-address and public_id.

What you want to do it have your own private bootstrap cluster. This enables you to restart your node and be up to speed in less than two minutes. 

‘Why not just a single node to boot from?’ you may ask. Fair question. A single bootstrap node can fall behind just like the pool. We need multiple interconnected pools running in a small cluster, in order to better keep up with the blockchain and know when we’re behind.

You create this cluster by settings up 5 nodes, all with a random, unique public_id set. Each node should have each other added as trusted nodes in their config files. This will make the nodes in the cluster connect to each other. All other configs apart from the trusted nodes should be left as default.

Next step is to get the cluster up and running. You start out by having all the default trusted nodes in the config file along with the new cluster ones. You start all five cluster nodes, and wait for a least one of them to be up and running. You then shut down the 4 nodes that haven’t finished bootstrapping yet, remove the default trusted nodes from their config file so only the cluster trusted nodes remain. You then start the 4 nodes again, and watch them bootstrap from your cluster node. It will take 2-10 minutes, depending on your hardware. When this is done, you shut down the original node, remove the default trusted nodes from its config file and start it again. You now have you own private bootstrap cluster! The next section will explain further why 5 nodes are needed. 

Restart node when it’s behind

How do you know when your pool is behind the top of the blockchain? You do that by running a private bootstrap cluster. By monitoring the tip of the cluster nodes and the pool node, you can determine if one or more of the pools are running behind or stuck in a fork. The method is not completely fail-safe, but has proven to be very reliable for many epochs.

The method is to create a script to check the tips of the nodes every few seconds, and find the maximum tip number. We then use this maximum tip number to compare with all the nodes, and see if any pool is behind the maximum tip. This will naturally happen for a second or so from time to time, so this is not enough. Each node should be allowed to be behind the tip for 3 consecutive checks, and then be restarted if it’s still behind. Once it catches up, the number should be reset.

By applying this strategy to both the node and to the bootstrap cluster, you will always have at least a few nodes to bootstrap from, that should be at the tip of the blockchain. If the pool node is reset, the script should pause until the pool node it running again, so the script will not restart the node you are bootstrapping from.

Restart node when log shows issues

Some errors will cause the node to partly crash, but not fully die. This can be seen in the log from the node. We have identified two errors that should cause you to immediately restart your node. Those are “Task panicked” and “cannot schedule getting next block”. Your script that checks for the blockchain tip number should also check the logs from the nodes if they have outputted any of those two strings. If a node has outputted at least one of the strings, it should be restarted.

Scripts

Scripts for setting this up in a generic way, that will work in most settings, will be provided soon. It will take some time to rewrite them for public use, since they are very specific for 1PercentPool currently. They will be released for public use as soon as they have been rewritten.

Thank you for running a stake pool and supporting the decentralisation of Cardano!