«Real-time systems are one of the most important applications of computers, both in commercial terms and in terms of social impact. Increasingly, ...»
The operating environment aﬀects the charged particle ﬂux. On the ground, in a benign environment, the most signiﬁcant source of alpha particles is probably the small amount of radioactivity that is naturally present in the casing of the chip. In space applications, the particle ﬂux can be much higher. The sun is the principal source of such particles: the solar wind regularly carries a large ﬂux (the earth’s magnetic ﬁeld shields the ground from most of this). The sun is, of course, a dynamic entity: it often generates solar ﬂares, which may involve the ejection in short order of many tonnes of charged particles. When especially severe solar ﬂares occur and where the earth’s magnetic ﬁeld is not such a strong shield (near the poles), a signiﬁcant increase in particle ﬂux is seen even on the ground, aﬀecting ground-based systems. For systems in space, the ﬂux is far greater and the rate of transient failures goes up proportionately. Some relief can be found in shielding; however, even with such precautions, failure rates are high [Petersen, 1997].
One ﬁnal piece of terminology is appropriate to mention here. In fault-tolerance, it is common to distinguish between a fault and an error. A fault is the underlying defect; an error is the external manifestation of that defect. Faults can exist for a long time before generating an error. In some cases, a transient fault may never generate an error. For example, a transient fault occurs in a memory cell when its value is changed; if the cell is updated before the next time it is read, that fault will never have generated an error.
2.4 Fault-Tolerant Structures Hardware, time, information, and software redundancy are used for fault-tolerance [Koren and Krishna, 2007, Pradhan, 1996, Sieworek and Swarz, 1999]. Of these types, time and hardware redundancy will be of relevance to this paper.
Of the many structures used in hardware fault-tolerance, two stand out. One is the Triple Modular Redundancy (TMR) structure; the other is the Primary/Backup (PB) approach. In TMR, three processors run redundant copies of the same workload and mask errors by voting on their outputs. Since it requires triplicated hardware, TMR is expensive and used only in the most critical core of fault-tolerant systems.
In the PB approach, a processor executes code and its output is checked for correctness by an acceptance test. One common check is to see if the output is in the expected range. Another is to verify that it satisﬁes certain conditions: for example, to check that the square root of x has been correctly calculated, one can square the output to ensure that we get back a result close to x. (This is only practical because the squaring operation is of far lower complexity than the square-root operation.) Acceptance tests are not perfect: they may miss incorrect outputs, or falsely ﬂag as erroneous a correct output. If the output fails the acceptance test, the execution of a backup copy of the failed task is initiated.
The principal diﬀerence between the TMR and PB approaches is that TMR entails forward error masking: multiple copies of the task are always executed. In PB, a backup copy is only executed if the acceptance test ﬂags a failure. Theoretically, there is no limit to the number of backup copies in order to respond to multiple failures of the same task execution.
It is worth pointing out that in the PB approach, the backup copies are not always clones of the original task. Backups may instead be lighter copies of the original, producing output of lower (but still acceptable) quality. The justiﬁcation for this is that the backups are only occasionally invoked; it imposes lighter constraints on the scheduling algorithm if backups take less time; also a more rudimentary implementation tends to be more reliable.
Fault-tolerant scheduling techniques generally assume the PB approach; if we use TMR, we have forward error masking and do not need to schedule backup copies.
The second type of redundancy that we need to consider is time redundancy. This is the slack that exists in a schedule between when tasks ﬁnish and their respective deadlines; this slack (during which a processor would normally be idle) can be exploited to provide a reserve of time during which to execute backup copies as necessary.
A good example of fault-tolerant structures is the Time-Triggered Architecture [Kopetz, 1997, Kopetz and Bauer, 2003]. Events are scheduled in advance and an a priori communication schedule can therefore be established. Fault-tolerance is applied hierarchically. The system consists of a number of nodes, interconnected by means of a communication network. The network is redundant for fault-tolerance; for example, if a bus structure is used, multiple buses are provided. The nodes are connected to the network through a communications interface and their communication proﬁle is monitored by means of guardians. These guardians look for evidence of malfunction and disconnect the node from the network as needed.
Each node is a fault-tolerant unit, typically consisting of replicated processors and other hardware to ensure fault-masking [Kopetz and Millinger, 1999]. The clocks are synchronized by an algorithm that can tolerate Byzantine faults [Kopetz and Ochsenreiter, 1987]. As long as individual unit failures do not overwhelm the fault-management capabilities of the node, they will be invisible to the rest of the system.
Replica determinism [Poledna, et al., 2000] is maintained hierarchically [Kopetz and Bauer, 2003]. The fault-tolerance at the nodes ensures this at the interface between the node and the network. The Time Triggered Protocol ensures replica determinism across the network. As mentioned earlier, this supports transmissions that are scheduled a priori: messages are pulled oﬀ the interface queue and transmitted according to this schedule. The identity of a message can be discovered by reference to the very same schedule. This clearly requires eﬀective clock synchronization across the network. The clocks can, for example, be synchronized by an algorithm that can tolerate Byzantine faults [Kopetz and Ochsenreiter, 1987]. Since even the best synchronization mechanism cannot guarantee zero clock-skew, each transmission time slot is ﬂanked on each side by brief intervals of silence. Clearly, as long as the intervals of silence are longer than the maximum clock skew, all transmissions will be guaranteed to be seen by each receiver as occurring in the same clock-tick.
When a system using a priori communication scheduling is used, enough slack must be provided in the schedule so that the system has the time to deal with a failure and still meet its communication schedule. Another option is to provide the system with alternative communication schedules which it can invoke whenever a fault has to be dealt with [Pop, et al., 2007].
2.5 Voltage Scaling The time taken to execute a task depends on the clock rate of the processor. The clock rate, in turn, is constrained by the circuit delays. It is well known that circuit delays are inversely proportional to the supply voltage. As a result, by changing the supply voltage (within certain bounds), we can change the execution time of a task. The price we pay for this is power consumption, which is proportional to the square of the supply voltage and is linearly proportional to the clock frequency; in the aggregate, if we scale the frequency as allowed by the lower circuit delays that follow an increase in the frequency, the power consumption is a third power of the supply voltage. In real-time systems, the focus has been on how to reduce the supply voltage in order to minimize consumption while still meeting critical task deadlines [Unsal and Koren, 2003]; in fault-tolerant scheduling, brieﬂy raising the supply voltage (within the bounds tolerable by the processor) may give the processor the boost of speed it needs to complete the backup. Keep in mind that failure is usually a comparatively rare event: occasionally spending additional energy to tide over a rare event is usually acceptable, even in power- or energy-constrained systems.
3 Basics of Fault-Tolerant Scheduling
At their foundation, all fault-tolerant scheduling procedures are similar. They aim to use time or processor redundancy to ensure that, despite the (permanent or transient) failure of a processor, the hard deadlines of critical real-time tasks will continue to be met.
The fault-tolerant scheduling approach we describe here is meant to provide the system’s ﬁrst, emergency, response to a processor failure. When a transient failure occurs, such an emergency response may be all that is needed: the failed processor may soon recover and resume its duties. If the processor failure is permanent, then if the scheduling is online, with tasks being either guaranteed or rejected as they come in, the system will take account of the reduced computational resources in guaranteeing any future task arrivals. If the failure is permanent and the scheduling is oﬄine, then the system should generate another oﬄine schedule (complete with backup tasks), targeted at the system with one fewer processor. Fault-tolerant scheduling algorithms provide the system with a means to continue meeting deadlines until this can be completed; their role is therefore to provide the system with some breathing room before it makes any longer-term adjustments to the task assignment and scheduling that may be appropriate.
While there are some variations from one approach to another, the general method of responding to a failure is as follows:
• Transient Failures: If the system is designed only to withstand transients that go away quickly, reexecution of the failed task or of a shorter, more basic, version of that task is carried out. The scheduling problem reduces to ensuring that there is always enough time to carry out such an execution before the deadline.
• Software Failure: Here, the failure is that of the software, not of the processor. Software diversity is used: backup software which is diﬀerent from the failed software is invoked.
Again, we have to make sure, in preparing for software faults, that there is enough time for the backup version to meet the original task deadline.
• Permanent Failure: Backup versions of the tasks assigned to the failed processor must be
invoked. The steps are:
– Provide each task with a backup copy.
– Place the backups in the schedule, either prior to operation for oﬄine scheduling or before guaranteeing the task for online scheduling.
– If a processor fails, activate one backup for each of the tasks that have been aﬀected.
In the rest of this paper, we denote by πi and Bi the primary and backup copies, respectively, of task Ti.
There are three general principles with respect to scheduling backups when permanent
failures may occur:
trying to protect against permanent processor failures or transients of signiﬁcant duration. For approaches whose sole aim is to protect against brief transient failures or software bugs, this does not apply.
– Principle A2: The backup copies are conditionally transparent in the schedule [Krishna and Shin, 1986]; i.e., they can be overloaded [Ghosh, et al., 1997]. By this we mean that, unlike primary copies, which cannot be allowed to overlap other tasks in their schedule (after all, the processor can only execute one thing at a time; note that for multicore chips, each of the cores counts as a processor), we can overlap backups in the schedule so long as their primary copies have not been scheduled on the same processor. This is because all we are trying to protect against is a single failure. (This is like having just one spare tyre in your car: you don’t know ahead of time which one of the four “primary” tyres is going to fail, but just one spare will be enough to replace any one of them.) Figure 1 illustrates this for a three-task system.
Backups B1, B2 whose primary copies have been scheduled on diﬀerent processors cannot both need to be activated to react to the same processor failure.
– Principle A3: A backup copy may not overlap any primary copy in the schedule. The reason for this should be obvious.
Throughout, unless otherwise stated, we assume that when processors fail, this is detected.
In particular, it is detected by the time any task which is currently running on that processor is scheduled to complete its execution. Further, we assume that, in a multiprocessor, one processor failure cannot induce another. Failure of the communication network is not considered, since the focus here is on rapid recovery from processor failures. Network failures can be guarded against by providing multiple paths connecting any pair of processors.
4 Transient Hardware Failure
4.1 Periodic Tasks: Static Approach The failure model we start with is that of brief transients aﬀecting a single processor running a workload comprised of periodic tasks [Burns, et al. 1996]. It is assumed that we do not have transient failures occurring more than once every τF seconds. These tasks are independent of one another and have no precedence constraints, i.e., one does not provide input to another.
The Rate Monotonic (RM) scheduling policy is used [Liu and Layland, 1973]. RM is perhaps the best-studied of all real-time scheduling algorithms; it is a preemptive algorithm meant for a workload comprising only periodic tasks4 where the priority of a task is inversely related to its period. That is, if we have two tasks T1, T2 with periods P1, P2 respectively such that P1 P2, then T1 has higher priority than T2. If two tasks have identical periods, then the tie can be broken arbitrarily. RM is a preemptive algorithm: a task only executes when there is no higherpriority task waiting to execute. It is assumed that the cost of preemption is negligible. To begin with, we will assume that transients are of negligible duration, i.e., a processor crashes and comes back up again very quickly. Later, we will relax this assumption.
Under these assumptions, there is just one processor; if a task is aﬀected by failure, it is reexecuted in the form of a backup.