The University of Cape Town currently maintains a set of core services for SAGrid running on the EMI and gLite middleware distributions. These set of services form the backbone of the SAGrid and enables the submission of computational jobs. When the initial set of core services were deployed, high availability was not regarded as top priority. During the Africa-Arabia ROC meeting the topic of providing highly available core services became a priority. We sat around a table and came with a fresh few ideas. One of them was to use the anycast protocol to provide the access.
To start off, the SAGrid site administrators will host the TopBDII in three places. University of Cape Town, University of Free State and the Meraka Institute (CSIR). Our upstream provider, TENET will provide SAGrid with a AnyCast address and configure BGP sessions so that failover, redundancy and routing the request to the nearest server of potential servers is achieved. If a site becomes unavailable, the routing stack will take care of routing clients and servers to the next nearest server. Another issue which could occur is if the service on the server fails but the server is accessible from a networking perspecitive. This is resolved by sending the server a poison pill. We will use a tool called " Monit" which will be setup on each of the core service servers to monitor the local services. Should a service fail during processing the "monit" application will run through a set of service checks to try and restart the service. Should the restart of these services be unsuccessful a poison pill will be issued to the server. Each site will follow a similar configuration. This is something new which none of the other international grids have adopted. According to a EGI representative for GRNET, DNS round-robin or site specific load balancers are used to maintain levels of high availability but lack national routing intelligence.