# «Real-time systems are one of the most important applications of computers, both in commercial terms and in terms of social impact. Increasingly, ...»

To check whether a given task set is schedulable (i.e., whether all its tasks can meet their deadlines), we follow the approach due to [Joseph and Pandya, 1986]. The worst-case response time for a task occurs when that task is released at the same time as one copy of every higherpriority task in the system (a proof is provided in [Liu and Layland, 1973] but the intuition behind it should be obvious). Let us number tasks in descending order of priority, with T1 being the highest-priority task and Tn the lowest (for some n). Suppose ei is the WCET of task Ti.

**Then, the response time of task Ti is bounded by:**

This expression is not hard to derive. T1 is the highest-priority task in the system, and so it is not delayed by any other task. It therefore starts executing the moment it becomes ready to execute and ﬁnishes no later than its WCET.

(0) and start with Ri = ei. This iteration will converge to the solution for (1) or we will, at Traditionally, periodic tasks in an RM-scheduled system are numbered in decreasing order of priority, i.e., Ti has higher priority than Tj if i j. Also, the RM framework has been extended to allow for aperiodic tasks by reserving space for them in the schedule [Lehoczky, et al. 1987].

some point, get a response time that exceeds its deadline, which indicates inability to meet the deadline.

Now, consider what happens if there is a transient failure. We need to make assumptions about which tasks are aﬀected by the failure. Let us assume that the failure destroys all results associated with the task that is executing at the moment the fault strikes and nothing else. That is, if a task has been preempted and is in a partially-completed state, the already-done portion of its computation will remain unaﬀected. This is reasonable if we assume that such tasks have their state stored in memory with suﬃcient error-correction coding.

**Based on this we can write an implicit equation for the response time of task Ti as follows:**

By comparing Equations (1) and (2), we see that the only new term on the rhs of (2) is ⌈Ri/τF⌉ max(e1, e2, · · ·, ei). This comes about because the fault may strike at a time when any task is executing. If the executing task at the moment of failure is of lower priority than Ti, its interruption makes no diﬀerence to when Ti completes. If it is task Ti itself that is hit, it will cause a maximum additional load of ei to be imposed on the system; if it is some higher priority task, Tj, it will cause an additional load of ej that will be visible to Ti. Such a delaying eﬀect happens no more than once every τF seconds.

In our derivation so far, we have assumed that the transient failure is brief, which is why its duration is not taken into account in our equations. If we want to relax this assumption, we can simply add the transient-recovery time, trec, to our equations. Thus, Equation (2) will be replaced by

If this worst-case response time is no greater than the task’s relative deadline (relative to its release time) for all the tasks, the task set is fault-tolerant schedulable.

One other notable result in fault-tolerant schedulability is the scheduling bound derived in [Pandya and Malek, 1998], where it is shown that under assumptions typically made in real-time periodic scheduling (deadlines equal to periods, all tasks periodic and independent), any periodic task set with a utilization no greater than 0.5 can sustain up to one transient failure without missing any deadlines. This is true even under the assumption that not only the executing task but all uncompleted and preempted jobs are lost upon being struck by a fault. Note that this is a suﬃcient, not a necessary, condition: while every task set that has utilization no greater than 0.5 can tolerate up to one fault, there are several task sets with utilizations exceeding this bound which are also fault-tolerant. Bounds for multiple faults have also been derived: see [Ghosh, et al., 1997].

4.2 Aperiodic Tasks: Online Convolutional Approach We now present an online convolutional approach, originally presented for aperiodic tasks using the EDF scheduling algorithm [Liberato, et al., 2000]. In contrast with many of the other algorithms surveyed here, this algorithm can deal with multiple failures. Convolutional refers to the computational approach used to determine whether the system can sustain schedulability. As for the aperiodic task loading, note that any procedure that works for aperiodic task sets can be extended to periodic task sets by considering every task that is released over the Least Common Multiple (LCM) of the task periods. Finally, while the algorithm was originally presented on an EDF scheduling base, there is nothing to prevent some other priority rule from being used.

The focus will be on manipulating the unﬁnished work at any point in time. In the absence of faults, the unﬁnished work function consists of jumps when a task arrives, bringing new work, which are then worked oﬀ as the processor executes. Denote by W(T, t) the unﬁnished work at time t due to task set T. If we reckon time in discrete units of one (rather than as a continuous

**quantity), we can write a recursion for the unﬁnished work as follows:**

Let fi denote the time when Ti completes execution if there are no faults. Now, consider a fault pattern, Φ = (φ1, φ2, · · ·, φn) where φi is the number of faults that strike during execution of Ti, i = 1, · · ·, n. Rather than condition this on when these failures actually occur, we “book” them for accounting purposes as if they all arrived at time fi and that each failure of a given task imposes a load equal to the worst-case execution time of that task. Under such a fault

**pattern, the unﬁnished work recursion can be written as:**

Deﬁne by δw(T, t) the maximum additional unﬁnished work at time t due to running task set T and having been struck by w failures. Note that this captures the worst-case unﬁnished work given w failures: that is, it represents the additional unﬁnished work if the failures arrive at the most inconvenient times possible.

Key to the algorithm is the observation that the lowest-priority task Tℓ ∈ T meets its deadline under the w-fault case if and only if δw(T, t) = 0 for some t ∈ [fℓ, Dℓ]. For a proof, see [Liberato, et al., 2000]; however, the result should be quite intuitive to the reader.

where slack(fi−1, fi) is the free time in the fault-free schedule in the interval [fi−1, fi] (which time is available to execute some of the fault-induced additional workload).

The cases for i = 0 and i = 1 should be obvious. The case for i 1 chooses the greater of two instances: the ﬁrst in which all w faults aﬀect tasks T1, · · ·, Ti−1, and the other where w − 1 faults aﬀect tasks T1, · · ·, Ti and an additional fault strikes Ti.

**From the deﬁnition of δw(T, t), we can write the expression:**

If Tℓ is the lowest-priority task in T and δw(T, t) = 0 for some Rℓ t ≤ Dℓ, we conclude that task Tℓ will meet its deadline despite the fault pattern F.

To check that every task in T can meet its deadline, we evaluate δw(T, t) over each of the task sets T = {T1}, {T1, T2}, · · ·, {T1, T2, · · ·, Tn}, to check for schedulability of T1, T2, · · ·, Tn, respectively, where the tasks are now numbered in ascending order of absolute deadline (and thus descending order of priority with respect to the EDF scheduling algorithm). Pseudocode for the algorithm is provided in Figure 2.

As an example, consider aperiodic tasks that arrive according to Table 2, where ri is the release time and Di is the absolute deadline of task Ti: We want to tolerate up to k = 2 failures.

Let us consider what happens upon the arrival of each task (tasks are assumed to arrive at their release times; the execution time of the online algorithm to check schedulability is assumed to be negligible).

When task T1 arrives at time 0, the task set is T = {T1}. We have fa = 3. We have δw=2({T1}) = 2e1 = 6. The slack time over [3, 10] is 7. Since δw=2({T1}) 7, the system can run this task successfully even in the face of up to two failures; T1 is therefore guaranteed by the system.

1. Define Ti as the task set consisting of the i highest priority tasks.

2. Initialize ℓ to 1. /* ℓ will be the index of the lowest-priority task in the task set under consideration; we will start with T1, consisting of just task T1. */

3. for (j=1; j=n; j++) { (a) Simulate EDF under the task set Tj and record the task finishing times, f1, · · ·, fj. Define fj+1 = Dℓ. Record the slack over the intervals [fi, fi+1] for i = 1, · · ·, j.

(b) Renumber the tasks in ascending order of their finishing times.

(c) Compute δw, i = 1, · · ·, j; w = 1, · · ·, k.

i (d) if (δw slack(fi, fi+1)) for all i = ℓ, · · ·, j, then the lowest-priority i task in Tj cannot be feasibly scheduled: reject the incoming task as not schedulable and exit the algorithm.

(e) Else, if (j == n), the incoming task can be scheduled together with all the previously-guaranteed tasks in the system. Issue a guarantee for the incoming task and exit the algorithm.

(f) Set ℓ equal to the index of the lowest-priority task in Tj+1.

}

Now, T2 arrives at time t = 3. It is easy to see that fb = 10. The slack over [10, 15] is 5 (since none of that interval is needed by either T1 nor T2).

**We have:**

Since δ2({T1, T2}) slack(10, 15), T2 cannot be guaranteed and is therefore rejected. The guaranteed task set remains {T1}.

At time t = 4, task T3 arrives. In the absence of failures, T3 would ﬁnish executing at time fc = 6. The slack over the period [6, 12] is equal to 6 since there is nothing scheduled for that

**time in the no-fault case. We have:**

Hence, T3 can be guaranteed by the system.

At time t = 13, task T4 arrives. By this time, tasks T1 and T3 have left the system; hence we treat T4 as the ﬁrst task.

It follows from this that T4 cannot be guaranteed to survive two failures. T4 is therefore rejected by the system.

4.3 Aperiodic Tasks: Adaptive Fault Tolerance When tasks arrive in a sporadic and unpredictable manner, one approach is to give that task as close a level of fault-tolerance as possible, consistent with the need to meet the deadline of all the tasks [Gonzalez, et al., 1997].

Three fault-tolerance approaches were considered in [Gonzalez, et al., 1997]: Triple Modular Redundancy (TMR), Primary-Backup (PB), and Primary-Exception (PE). TMR and PB have been described in Section 2.4. PE is similar to PB, except that instead of initiating a backup task upon the failure of the primary, the system generates an exception. One variation on PB is SCP-TMR, where the primary element consists of a duplex, forming a self-checking pair (SCP).

Self-checking is carried out by having each element of the duplex execute the task and comparing the results generated by them. If they agree, the self-check is passed; if they disagree, a third processor (the backup) is invoked and has the casting vote.

When a task arrives, the scheduler guarantees to it whichever level of fault-tolerance the system can aﬀord, based on what has already been committed to other tasks; a guarantee is then issued to this task for this level of fault-tolerance. Alternatively, the scheduler can postpone committing to a particular level of fault-toleance to the point when the task actually starts executing. In this case, initially, only a minimal level of fault-tolerance is guaranteed to the application. If a higher level is feasible at the point when this task starts executing, this is provided. This late-commitment approach allows the scheduler to potentially take better care of other tasks that arrive after this task did, but before it commenced execution. If a task arrives to ﬁnd insuﬃcient resources, it cannot be guaranteed by the scheduler.

Two metrics were used to characterize this algorithm: the guarantee ratio, which is the fraction of the number of tasks guaranteed execution by the scheduler, and the average replication factor, which is the ratio of the sum of all task copies to the number of task arrivals to the system.

The greater the guarantee ratio, the more tasks can be guaranteed; the greater the replication factor, the greater is the expected fault-tolerance.

Detailed simulation results for this algorithm are provided in [Gonzalez, et al., 1997]. They indicate that adaptive techniques are generally signiﬁcantly better than static approaches, where the fault-tolerance level is ﬁxed a priori.

5 Software Faults We now look at an approach which seeks to deal with software faults [Han, et al., 2003]. There is a single processor and there is no requirement on the part of the algorithm to protect against hardware faults. In practice, this situation may arise when some other redundancy technique (e.g., TMR) is used to protect against hardware failures.

Software fails when it is given a set of inputs for which it produces an incorrect output. The fault is that of the software; it is assumed that the underlying hardware is fault-free. As before, we assume that there is some eﬀective mechanism to detect when an output is erroneous.

To survive software failure, we use alternative versions. Unlike in the cases considered earlier, where we were trying to protect against hardware faults, these backup versions cannot be copies of the same software: if they were, they would all fail on the same set of inputs, and multiple copies would be useless. Equally, note that the hardware fault-tolerant approaches mentioned earlier would also work for software failures so long as the reexecuted versions were diﬀerent from the one that failed for that input.