Most people write software code without considering what can go wrong, thinking that if you code everything correctly, then software should never fail. The reality is that software runs on hardware, and everything that touches the real world has the potential to fail. Rightley and Jordan discuss the importance of implementing fault tolerance into systems and share manageable tips to help you get started.

Need to implement a fault tolerant system?
If you are not sure where to begin, or need assistance implementing a fault tolerant system, we are here to help!

Schedule a Free Consult

Transcript
Rightley –
Most people write software code without considering what can go wrong. You may think that if you’re doing everything correctly, then theoretically, your software should never fail. But in reality, all software runs on hardware, and everything that touches the real world has the possibility of failure. When you’re designing software, things inevitability can go wrong. You can’t assume that anything is always going to work. If your device absolutely cannot fail, you need to consider building fault tolerance into the system. So, I asked Jordan to sit down with me and talk to us a little bit about shedding some light on what fault tolerance is and some best practices so that you can implement it in your own systems too and have the right frame of mind next time you’re programming a system that can’t fail.

So Jordan, just give us a little background on what is fault tolerance.

Jordan
Fault tolerance is the ability of a system or application to continue operation after a fault error or failure has occurred and has been detected. The functional operation may be degraded a little bit based on the severity of the failure, but unless the severity of the failure requires the device to enter a failsafe state, you can probably recover and either retry or continue operation and keep your system going. Fault tolerance is normally sought after in high reliability or life critical system.

Rightley –
So, if you’re creating a critical system, for example a medical device or some kind of industrial controller for the process automation industry, where do you begin creating or building fault tolerance into the system

Jordan – 
First, you need to start analyzing what can go wrong. Like Murphy’s first law, if anything can go wrong it will

Rightley –
And you mean that, you’re talking about the hardware where it touches the outside world.

Jordan –
Yeah, something’s going to fail at some point and you need to be prepared for it. Really, risk management is where this starts. You need to determine what can go wrong in your system, how critical it is, and can you fix it or do you need to fail so that no one is going to get hurt or an explosion would happen, or something along those lines.

Rightley –
So when you talk about something going wrong, that can be communications with an outside system, that could be memory sectors going bad on the system that you’re actually running on, or things like that. What do you mean by handling it and how do you recover from it?

Jordan –
The first step is being able to detect the error. In communications we run CRCs or check sums to make sure the message you’re getting is valid, for one, it could’ve been corrupted.

Rightley –
So a CRC by the way is what?

Jordan –
Cyclic Redundancy Check. It’s verifying that your bits, that the bits weren’t modified in transmission, pretty much.

Rightley –
So catching single bit or couple bit errors.

Jordan –
Correct. Once detected, it needs to be recorded. A lot of times we see systems where error handling is an afterthought. And an error happens, yeah you get that it’s an error, but where did it happen? Nothing’s recorded and then when you’re trying to troubleshoot it you have nothing.

Rightley –
So basically, when the device comes back into the programming house from the field, you have nothing to go on to try and recreate the error.

Jordan –
Now this doesn’t really happen in a system that’s built for fault tolerance because the risk management and fault is put in ahead of time.

Rightley –
You’re anticipating something could go wrong and trying to collect the information so you can figure out what it was.

Jordan –
Yes, making sure you have everything you could possibly need in the event something fails.

Rightley –
So what are some basic, what’s a thought process or some steps for, do you retry, how do you do that?

Jordan –
Once it’s detected, definitely it should be recorded. And then you take steps to either retry, recover, or fail safe. Kind of in that order. A lot of times people just retry, like in message communications. If there’s something you can do to recover, either switch to another sensor that’s in the same area, or if there’s duplication in the system. Or, if there’s nothing you can do and it’s critical to the operation of the system you have to fail safe so no one gets hurt.

Rightley –
So you bring a human in to try and remedy the situation.

Jordan –
You can bring a human in or a controller knows to shut, most of your sensors or devices in this scenario have a failsafe state or you tell it what the failsafe state is. Like a heater would failsafe off probably.

Rightley –
Or a valve would failsafe closed.

Jordan –
Depending what it’s used for, yeah.

Rightley –
What’s a common retry pattern?

Jordan –
For some reason, the thing we see the most is 3 retries. There’s no real reason for it,

Rightley –
Other than it’s one more than trying one more time?

Jordan –
Yeah, pretty much. I mean, you can do 5 and 10, but it depends on the situation and timing, criticality, if you’re doing things every 5 seconds, retrying 10 times, that’s 50 seconds, that’s way too long.

Rightley –
Oh right, so if you’re going to failsafe, you don’t want to try for a full minute while some process is going out of control.

Jordan –
Yes, it’s time dependent, and how fast you can retry.

Rightley –
And also I’d imagine if you’re trying to close a valve, for instance, and you’re waiting on it to close, don’t retry 3 times in a 10th of a second if it takes the valve a second to actually close. So you have to suit the retries to the application, right?

Jordan –
Yes, everything should be ironed out, if you do the risk management right you should know some of these timing characteristics and take that into account when you’re building the system.

Rightley –
Ok, so let’s try and paint a picture for anybody that’s going to be in this situation. Start if you could, what’s a real-life situation everybody could relate to where failsafe would be important?

Jordan –
Well, one thing everybody probably comes into in their daily life is thinking of driving a car on the highway at highway speeds. If some critical part of the system would fail, do you want the car to come to a dead stop or do you want it to allow you to get safely off the road? Say your tire blows, now because, I guess it’s cause the wheel was round, but you can still drive it to a safe place whether you’re going to do damage or not, but it’s going to let you fail safe, get to a safe spot.

Rightley –
When the TPS sensor realizes the tire’s blown it doesn’t shut the car off.

Jordan –
Yeah, it doesn’t lock the brakes up, or anything like that

Rightley –
Gotcha. So that sounds like kind of a no-duh sort of example in real life. Don’t we see this when people are programming things they don’t always think about these sort of considerations when they’re programming say an industrial controller or something like that

Jordan –
Correct.

Rightley –
So, what would be an application that we’ve seen, let’s say an industrial controller, that really should be, in almost any situation you should think about building fault tolerance into? Say communications, or being thread safe between operations what sort of things always you should build fault tolerance into?

Jordan –
Any time there’s critical communications with a sensor or another controller, another device, you should definitely be checking the validity of the data, the message. And making sure the values are in range. If you have a temperature sensor that all of the sudden jumps from say 50 degrees C to well over 100-200 degrees C in less than a second that’s not normal operations, temperature doesn’t normally change that fast.

Rightley –
It can’t happen in the real world so you should try again.

Jordan –
Yes, try again and make sure that that was a good reading. If it continues say 3 times or depending on the timing, you need to failsafe if it’s critical, like a laser. If a laser’s overheating and all of the sudden it hits way out of range and it could possible cause damage you got to shut it down.

Rightley –
So, kind of looping back to what we said in the beginning, any time your code touches the outside world you need to think about fault tolerance. Whether that’s reading and writing to memory which could go bad or controlling the laser

Jordan –
Or a hard drive or flash, anything that’s critical to the operation of the system.

Rightley –
So, bringing it back around, 3 main points to remember about fault tolerance is:

Design it into the system any time that your code touches the outside world. That could be 3rd party libraries, external sensors or hardware, onboard resources like RAM or flash or hard drive or especially any time a user may input into the system, right?

Then, always think about those 3, retry, recover, and failsafe, in that order. And remember, failing safe can include duplication or hardware, or actually even where you store something in memory. And use those levels of protection accordingly.

Third, make sure whichever way you’re going makes sense for the process. Don’t retry so long that a process could go out of control, but also don’t retry so fast that you could just wait a few moments for a valve to close and maybe just record that it’s a little bit sticky. So, make the approach suit the actual application.

Thank you everybody for taking the time to watch this. Thank you, Jordan, for taking the time to talk to everybody about this a little bit.

If you find yourself needing some sort of help implementing fault tolerance into a system that you’re working on and you don’t know where to start, drop us a line in the comments or you can always email us at info@psi-software.com. Thanks!

PARTNER WITH PSI

Our engineers have the capability to develop safety-critical control systems. Tell us about your next project!


No Comment

You can post first response comment.

Leave A Comment

Please enter your name. Please enter an valid email address. Please enter message.