Read-only Mode Due To Zookeeper Timeouts

Problem: You keep running into read-only mode issue and you see a pattern of zookeeper errors in the logs as shown below

2020-06-01 14:48:03,107 WARN  [main] zookeeper.ZKUtil: clean znode for master0x0, quorum=localhost:2181, baseZNode=/hbase Unable to get data of znode /hbase/master
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
        at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:354)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:714)
        at org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.deleteIfEquals(MasterAddressTracker.java:267)
        at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:149)
        at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:141)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)

Cause: This error occurs when the Zookeeper session times out. This often happens on single-node deployments due to high disk contention from large Spark jobs or disk heavy operations (such as backup or restore). There are two Zookeepers active in Tamr and its dependencies. Tamr uses one Zookeeper instance, while HBase uses its own Zookeeper instance. Updating the timeout on both is recommended if the above error is occurring.

Resolution: In order to resolve the issue, please do the following:

  1. Update your Tamr Zookeeper session timeout to 4 minutes. Using the unify-admin.sh utility, update the variable TAMR_ZK_SESSION_TIMEOUT to 240000. Note the unit for this variable is milliseconds.
  2. Update the HBase Zookeeper session timeout.

In newer versions, simply update the following environment variables using the unify-admin.sh tool. Set TAMR_HBASE_ZK_SESSION_TIMEOUT to 240000 and TAMR_HBASE_ZK_TICKTIME to 12000. In extreme cases where disk contention is particularly high (often during a backup or restore operation), it may be necessary to increase TAMR_HBASE_ZK_SESSION_TIMEOUT to 600000 (10 minutes) and TAMR_HBASE_ZK_TICKTIME to 30000.

Finally, restart Tamr and Tamr dependencies for the new configurations to take effect.