So we've been making use of GlusterFS for a while now and generally it's been great. Gluster allows us to present unused space on a number of networked servers (our HPC worker nodes) as a single disk pool and make this available to researchers. We currently have two scratch areas, each one made up from a respective group of worker nodes; the 200 series join together to make scratch01 and the 400 series join together to make scratch02.
We learned a while back that it's improtant to get disk striping correct. When scratch02 was added the striping was set to 1, which meant that files were written to individual worker nodes causing the file systems to fill up consecutively (especially when large files were written) rather than all the worker node disks filling up simultaneously but far more slowly. Additionally in future iterations of HPC clusters we'll ensure that free space and OS areas are partitioned to avoid contention for critical file space.
Our more recent issue was a bit more esoteric. Gluster is an abstratcion of disk space, but there is another layer of abstraction hidden from the users known as peering, where the gluster daemons on the worker nodes communicate amongst themselves to advertise availability and resources. The peering of our 200 and 400 series are intermingled, which means that the 200 and 400 series nodes are aware of each other at a peering level, even though they never interact. This theoretically is not a problem. However as in all things in life theory and reality can diverge and there are a number of learning points we'll be taking with us when we start working on the next iteration of our HPC cluster.
That being said, scratch02 is available again.
The error can be seen below and its location is /var/log/glusterfs/etc-glusterfs-glusterd.vol.log
socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer
Since the 200 / 400 series peers are all considered as one gluster resource this makes it difficult to just restart volumes which are consumed from the 400 series bricks only. The solution was to unmount all volumes and restart the gluster daemon on all peers. After the restart the gluster volumes mounted successfully. We've since implemented daemon monitoring on all cluster nodes.