GlusterFS is an open source, distributed file system capable of scaling to several petabytes and handling thousands of clients. The HPC team at UCT had some time to play and implement Gluster. The idea behind Gluster is to scale the available space of servers into a pool which can be used to store data of (n)size. A general first impression of Gluster was the impressive CLI with very simple execution commands. On a more technical angle, Gluster is the only distributed file system that I know which has no requirement for metadata servers / chunk servers or logger servers.
The architecture aggregates its storage into a global namespace. Gluster makes use of the FUSE (Filesystem in Userspace ) module which allows for non-privileged users to create their own file systems.
Configuration commenced by creating a distributed striped volume across 8 servers with a TCP, 1gigE network. The internal disks which are mirrored. This is a good thing as the distributed striped volume has no redundancy.
dd if=/dev/zero of=/scratch01/dd-test.dd bs=10235
47286+0 records in
47286+0 records out
483972210 bytes (484 MB) copied, 4.90815 seconds, 98.6 MB/s
dd if=/scratch01/dd-test.dd of=/dev/null
349409+0 records in
349408+0 records out
178896896 bytes (179 MB) copied, 4.99041 seconds, 35.8 MB/s
One of the reasons why the read speed is less than the write speed was caused by a bug in Gluster which we have identified and submitted to Gluster Dev support here and they in return have released a fix for it here.
The bug surfaced when creating a stripe volume and data written to the FUSE mount was read back in binary format. Using a tool called “od” with the "-x" option we were able to identify the inconsistencies and reported the bug to Gluster. The fix has been incorporated into the next release Gluster 3.3.The workaround was to disable the performance.quick-read setting under the volume being accessed.
To apply the workaround “ gluster volume set <vol_name> performance.quick-read off “ . However this may be the reason for the slower read speed we receive when reading data from the FUSE mount into /dev/null. We will update this blog with a post stripe.c fix update and enable the quick read option. This option is enabled by default when creating a volume.