The best CTDB bugs ever!

WhoAmitay Isaacs, Martin Schwenke 2014, Perth
WhenJanuary 2014

CTDB is the Clustered Trivial Database (TDB) and is used to provide clustering support for Samba. The CTDB daemon is highly asynchronous, event-driven, low latency, single-threaded, non-blocking and just a bit scary. CTDB makes extensive use of the talloc and tevent libraries.

While trying to improve performance, we found lengthy packet queues, extreme CPU usage and nodes running out of memory. This forced us learn some basic performance analysis and it wasn't rocket science! Fixing the performance issues required trivial data structure improvements, some code refactoring and even redesign of subsystems. The resulting performance improvements have been stunning and allow CTDB to support thousands of concurrent SMB connections per node.

On systems with idle CPUs we have seen tevent complain that is has taken a long time - sometimes minutes - to process events that normally take milliseconds. How can this happen? Why has this become our indicator for general system performance issues?

When can a simple "time robustness" test render a cluster useless? What triggered race conditions that caused (apparently) inexplicable hangs? Why would a pointer comparison stop CTDB from shutting down once in a while?

We will take you on an entertaining journey through our "favourite" CTDB bugs and the lessons we have learned. You'll laugh, you'll cry, you'll nod in agreement and shake your head in disbelief. You will be amazed!

SlidesPDF (753.1 kB)