Reliability Issues

Data communications networks usually exist within an environment that is designed to accommodate people, with consequently little variation in temperature or humidity. Industrial control systems may be required to operate in conditions where extremes of temperature and humidity are common, or there is frequent or severe electrical interference. In addition to being more rugged, therefore, industrial networks need technology to be applied in a different way. Whilst a delay of a few minutes for a data communications network while backup and recovery systems kick in may be acceptable, for industrial networks this kind of delay is unacceptable. For most industries, the cost-per-minute of downtime may be thousands of pounds, and downtime avoidance is a major factor in the design of manufacturing systems, with single points of failure seen as something to be avoided at all costs.

Star topologies that by their very nature introduce a single point of failure into a networking system are still being used in industrial networks. It is now becoming standard practice, however, to deploy PLCs in such a fashion that each PLC provides localised control of processes or applications rather than having everything controlled from a central location. This also makes it easier for individual controllers to be taken off line while a section of the plant undergoes maintenance, rather having to close down the entire plant, or significantly large areas of it. Furthermore, the failure of a controller in one section will only affect that particular section.

Most modern industrial networks are designed around the concept of distributed control. A backbone of small Ethernet switches can be built up using optical fibre, inherently immune to electrical noise and having a high bandwidth. Comparatively short UTP connections (since UTP is susceptible to noise) are used between the switches and PLCs or other devices. Single points of failure in the fibre links can be reduced by a careful choice of topology. It is fairly common practice to have several cable trays running around an installation factory to segregate data, low voltage power, medium voltage power and so on. Because the data tray is typically well-populated and frequently disturbed, and because fibre is not susceptible to electrical interference, it is becoming common to find fibre alongside medium voltage power, which is usually kept out of harm's way and employs the kind of large radius bends preferred for fibre.

Yes, there are still single points of failure. The risk can be minimised, however, using a variety of redundancy mechanisms. One such mechanism, designed by Hirschmann, is called HIPER ring, and has since become a de facto standard in industrial networks, endorsed by players such as Siemens, Asea Brown Boveri (ABB) and Rockwell, while many competing products have taken their inspiration from it. The basic concept of HIPER ring is to wire all of the small Ethernet switches together in a ring. While the idea of a ring topology is inherently at odds with the way Ethernet works, it can be successfully deployed provided one of the links is inactive, thereby eliminating the possibility of a broadcast storm occurring.

The de-activated link is continuously monitored, however, to ensure that it can function if required to do so. Thus, if one of the other data links in the ring fails, this redundant link can be activated to restore connectivity between all of the switches in the network. The expected time delay between active link failure and the activation of the redundant link is 200 to 300 milliseconds fast enough for most (though not all) industrial applications. Hirschmann are working on reducing the delay to 50 milliseconds. Obviously, because the failover is transparent to the users of the system, a notification of the link failure is also required to ensure that the failed link receives attention immediately.

The idea is not entirely new, and owes much to the Spanning Tree Algorithm (STA) used first in bridges, and later switches in local area networks. The STA was fine for LANS, but only works for a limited number of bridges or switching devices, and involves a relatively long delay (in the order of thirty seconds) in order to reconfigure the network. While the Rapid Spanning Tree (RST) has now evolved, with a response delay of less than one second, the limitation on the number of devices remains an issue, and the reconfiguration itself is not stable for several seconds, during which time loops may occur.

An alternative is simply to provide a number of redundant links. A significant amount of redundancy is often desirable, and does allow for cables linking the same two points to be routed along different paths to preclude the possibility of both links becoming unavailable due to the same physical event. Further redundancy an be designed into the system in the form of redundant network devices such as switches, and even redundant control devices such as PLCs. In the final analysis, the more redundancy you design into the system, the more reliable it will be, but the initial costs will also be higher. It is really a case of weighing the cost of potential down-time or damage against the cost of implementing an appropriate level of redundancy to prevent such events from occurring.