Academic Journals Database
Disseminating quality controlled scientific knowledge

FAULT TOLERANT SCHEDULING STRATEGY FOR COMPUTATIONAL GRID ENVIRONMENT

ADD TO MY LIST
 
Author(s): MALARVIZHI NANDAGOPAL, | V. RHYMEND UTHARIARAJ

Journal: International Journal of Engineering Science and Technology
ISSN 0975-5462

Volume: 2;
Issue: 9;
Start page: 4361;
Date: 2010;
VIEW PDF   PDF DOWNLOAD PDF   Download PDF Original page

Keywords: Grid Resource Management | Grid Job Scheduling | Checkpoint Replication | Fault Tolerance.

ABSTRACT
Computational grids have the potential for solving large-scale scientific applications using heterogeneous and geographically distributed resources. In addition to the challenges of managing and scheduling these applications, reliability challenges arise because of the unreliable nature of grid infrastructure. Two major problems that are critical to the effective utilization of computational resources are efficient scheduling of jobs and providing fault tolerance in a reliable manner. This paper addresses these problems by combining the checkpoint replication based fault tolerance echanism with Minimum Total Time to Release (MTTR) job scheduling algorithm. TTR includes the service time of the job, waiting time in the queue, transfer of input and output data to and from the resource. The MTTR algorithm minimizes the TTR by selecting a computational resource based on job requirements, job characteristics and hardware features of the resources. The fault tolerance mechanism used here sets the job checkpoints based on the resource failure rate. If resource failure occurs, the job is restarted from its last successful state using a checkpoint file from another grid resource. Acritical aspect for an automatic recovery is the availability of checkpoint files. A strategy to increase the availability of checkpoints is replication. Replica Resource Selection Algorithm (RRSA) is proposed to provide Checkpoint Replication Service (CRS). Globus Tool Kit is used as the grid middleware to set up a grid environment and evaluate the performance of the proposed approach. The monitoring tools Ganglia and NWS (Network Weather Service) are used to gather hardware and network details respectively. The experimental results demonstrate that, the proposed approach effectively schedule the grid jobs with fault tolerant way thereby reduces TTR of the jobs submitted in the grid. Also, it increases the percentage of jobs completed within specified deadline and making the grid trustworthy.
Why do you need a reservation system?      Save time & money - Smart Internet Solutions