Overview
Welcome to the Resilience Control Systems University Challenge website. Here you can find information about the joint venture between Idaho National Laboratory and research schools in the advancement of resilient controls.
The Coupled Four-Tank System
Publications
“Model Based Control of a Four-Tank System,” E. Gatzke, E. Meadows, C. Wang, F. Doyle
“21 Steps to Improve Cyber Security of SCADA Networks”
“Resilient Control Systems: Next Generation Design Research,” C. Rieger, D. Gertman, M. McQueen
“Deception Used for Cyber Defense of Control Systems,” M. McQueen, W. Boyer
“Human Factors and Data Fusion as Part of Control Systems Resilience,” D. Gertman
A Schematic of the Coupled Four-Tank System
Challenge Overview
The goal of this challenge is to develop a control system design that maintains quantifiable, stable control in spite of threats, including process disturbances, sensor degradation, cyber intrusion, human error, and related interdependencies. But, what exactly does that mean?
To minimize the number of humans required to operate an industrial process and to ensure consistency of operation, a control system is implemented. However, while this is great for automation, it can cause some challenges, too. For instance, a sensor could start to fail without being recognized and causing improper process operations. In addition, various sensor data or control system components could be corrupted by a malicious actor.
Resilient control systems take into account all these possibilities and work to reduce the effect it has on the output of the system. That way, our Engineering systems stay efficient and the benefit is passed on to society.
Resilient control systems are a great way to link cross-disciplinary people together with a singular goal in mind. The likes of Electrical, Computer, Chemical, and Mechanical Engineers can all pool their skills together to make one system that proves to be resilient.
Experimental System
One specific way we are researching resilient control systems is the 4-tank system seen above. That picture is a little complicated, so we will simplify it down. The system consists of a few things: Water tanks, some pumps, and some valves, and a PLC (Programmable Logic Circuit) to act as the brains of the system.
Let’s start with the pump #2 on the right: In our system, this pump is always on. The first valve after the pump throttles the flow of the system. The second valve is used to control the amount of liquid flowing into Tank #4. The valve setting is based upon an indication of level, which maintains the liquid level using an automation algorithm such as a Proportional-Integral (PI) controller. When the valve is completely open, all of the water flows to Tank #4. When the valve is completely closed, all of the water flows to Tank #1;. However, the water that is not used to fill Tank #4 goes to Tank #1, which gravity drains to Tank #2. This creates a coupled disturbance for the control of level in Tank #2. In similar fashion, flow from Pump #1 is directed to Tanks #2 and #3. This is what we call a coupled system.
Access to the PLC data, as well as the control software if desired, is over a wireless network to a PC running an application such as Labview. Both the cyber security resilience, as well as the challenge to present usable information to the operator, complicate this interdisciplinary challenge.
About Resilient Control Systems (RCS)
ICS networks control numerous important processes of daily life, from electricity and food production to sewage and water management. The importance of the operations they affect must be reflected in their endurance and ability to recover from abnormal circumstances. Presented below are some areas of ICS systems that Resilience seeks to address, as well as some examples to help illustrate these problems.
Problems to Address
- Unexpected Faults – Within control systems design, abnormal behavior may result when an unexpected or untested disruption occurs within the system. Such disturbances may cause certain mechanisms to act outside of their bounds and threaten the stability of the ICS.
- Complex interdependencies and latency – More control systems are being integrated with IT systems, leading to added complexity in the system. Latency and other negative impacts may be caused by unknown network couplings.
- Human performance prediction – The human operator’s contribution to the resilience of the system cannot be adequately measured by sensors or any other form of definitive measure. True resilience is also a cross-discipline goal, and will not be accomplished if a single individual does not have knowledge of all threats to the system.
- Cyber security – ICS will continue to be an attractive target for malicious attackers, be they amateur novices, disgruntled employees, organized crime or even nation states. As such, they need to be protected. However, certain security parameters such as multiple passwords, active scanning, etc., can disrupt or delay ICS processes. In some cases, the result of these activities may be as detrimental as a cyber attack.
- Multiple performance measures – “The performance of control systems is based upon several measures, not just process stability, but also physical and cyber security, process efficiency, and process compliancy” [1]. Each of these measures has their own goal, without any true unified platform to build from.
- Lack of state awareness – In RCS, the multitude of conditions that may threaten an ICS makes it difficult to evaluate the exact state of the system. It must be analyzed from all points of view–physical and cyber security, process efficiency, general errors, etc.– in order to make a judgment about the RCS. If the incorrect state is assumed, the entire system may be at risk.
[1] C. G. Rieger, “Notional Examples and Benchmark Aspects Of a Resilient Control System,” 3rd International Symposium on Resilient Control Systems, August 2010.
Examples
Power Transmission Substation
Scenario
A transmission substation is currently interfaced to a supervisory control and data acquisition system (SCADA), down to the relay level for monitoring and control. This substation is part of a large transmission system, and has state-of-the-art gear with IEC-61850 interfaces. The communications to each relay is over an Internet Protocol (IP) based network, with standard Information Technology (IT) routing and segmenting equipment. The substation switchgear is protected from the environment by a building, and for physical security, has a locked door and surrounding fence with a locked gate. Inspection visits to the substation occur infrequently and normally only for maintenance purposes.
During a backshift, a dispatcher receives an indication that one of the relays in a substation has caused a breaker to isolate a critical line, which would reduce capacity and potentially cause a blackout in a section of a neighboring large city. However, the monitoring data from a downstream phasor measurement unit (PMU) indicates that power is still flowing to the city. Within one scan cycle of the substation SCADA system, the breaker status returns to the normal closed position. To be cautious, the dispatcher sends a crew to investigate the associated substation to confirm the status of the breaker. In the course of investigation, the crew finds that the lock on the fence is missing and the door to the substation is unlocked. Had the door been opened, a status alarm should have been indicated back at the control center. Investigation of the substation indicates that the breaker with the suspect condition is still closed and operating correctly. As the substation appears undisturbed, other than the missing fence lock, it is concluded that further investigation is unnecessary.
In a week following this incident, a different crew is on the backshift when numerous calls are received that the neighboring city has lost power. The indication in the control center seems to reflect an absence of problems. Again, a crew is dispatched to the substation that had been inspected the week before. While the physical security of the substation does not appear to be compromised, it is found the breaker previously investigated is now open, as are several others, creating an overload condition on the transmission lines to the city that eventually tripped power. A foreign wireless communications device was found with an investigation by security, indicating a back door was created by an individual entering the substation [1].
Analysis
- Unexpected Faults – The facility was physically broken into, and a backdoor was installed into the system.
- Human performance prediction – Door alarms such as the one presented in the above scenario are seen as low priority and are often turned off by operators due to annoying false alarms or apparent lack of need. In this case, the alarm was turned off due to recent maintenance and left off by accident [1].
- Cyber security – There was a lack of specific security measures to prevent the foreign wireless device from connecting. This lack provided the cyber exploit needed to compromise the system.
- Multiple performance measures – Resilience is the result of the ICS being examined from many viewpoints. The dispatcher in this case had no data to suggest this was a cyber attack, and lacked the experience to suspect such a thing may have occurred. Those who inspected the site were likely the same. Without operators being trained to examine multiple possibilities for alarms, resilience cannot be affirmed.
- Lack of state awareness – The central authority was unaware that the building had been entered, due to the lack of alarm. This was further compacted when the attacker infiltrated the communications between the remote base and the central command building. They then published false readings and hid their activity.
[1] C. G. Rieger, “Notional Examples and Benchmark Aspects Of a Resilient Control System,” 3rd International Symposium on Resilient Control Systems, August 2010.
Chemical Facility Reactor
Scenario
A chemical reactor unit operation is automated with a state-of-the-art distributed control system (DCS). The DCS provides multivariable control of the reactor, which is provided via an optimal control methodology. The sensors that provide the data for this multivariable design are interfaced to multiple redundant controllers based on proximity to the process equipment, requiring exchange of data to the controller hosting the optimal control design. The communications system is an IP based design, which interconnects all of the controllers. The DCS system is isolated from the business systems via standard information technology (IT) devices, namely firewalls, segmentation, and demilitarized zone (DMZ) protections.
During the operation of this system, a failure of a group of sensors occurs. Depending on the type of failure, this event may or may not be recognized and responded to by current research philosophies. If they fail to normally accepted high or low levels and generate an alarm, they will be easily addressed. However, if they fail in a known good state or normal range, limitations in current research philosophies become apparent and do not address the issues addressed by the situation in a holistic manner. This failure could be due to cyber attack specific to an OLE for Process Control (OPC) server or a wireless access point, or it could be due to software failing in an undesirable or unexpected manner. [1]
Analysis
- Unexpected Faults – The sensors failed, but within an accepted range. This led to the reporting of inaccurate data, and no alarms were triggered.
- Complex interdependencies and latency – The complexity of this system’s interactions was inadequately understood. In this case, the system failure could have been caused by increased latency and incorrect data transferred within the hierarchy structure controlling the sensors.
- Human performance prediction – To prevent this scenario from happening, the operator must detect that the sensors are reporting erroneous readings. If they are unaware, there is the risk that the problem will not be noticed until the error begins to affect other systems. At this point, the operator will be responsible for fixing these added issues and must also be able to track where the problem began to prevent it from reoccurring.
- Cyber security – This type of attack could be caused through latency injection, and would be difficult to detect by those not trained in cyber security tactics.
- Lack of state awareness – Without a mechanism to measure the priority of failed sensors, state awareness cannot be maintained. If the failure is unknown to the operator, other events may be deemed to have higher priority without full knowledge of the system.
[1] C. G. Rieger, “Notional Examples and Benchmark Aspects Of a Resilient Control System,” 3rd International Symposium on Resilient Control Systems, August 2010.
Heating, Ventilation, and Air Conditioning (HVAC) System for a Hazardous Facility
Scenario
A facility that is producing hazardous substances has an advanced HVAC system for regulating pressures within the building. By maintenance of pressures with the most hazardous areas at the lowest pressure and normally occupied spaces at the highest, the migration of hazardous substances can be prevented. The system design also uses supervisory control, in that a neural network design implements night-time setbacks increases the air conditioning set points to reduce overall energy usage. The primary temperature and differential pressure control on the system are through some form of PID algorithm, with each hazardous zone having its own controller and separate temperature controllers for the hazardous and occupied zones. Intake and discharge ducting and blowers are common for the hazardous and occupied spaces, but each area has an individual header. In the case of the hazardous areas, high efficiency filters are used to remove pollutants.
During the morning before the workers arrive, the exhaust airflow from the facility gets largely blocked due to an abnormal failure of a damper. This creates a back pressure on both the hazardous and occupied zones of the facility. In response to the reduced airflow, the inlet damper of each hazardous zone closes to maintain the required differential pressure. However, minimum facility flows are not maintained and the damper controls are not able to equalize consistently, allowing periods where potential migration of hazardous species may occur. In addition, the drop in airflow prevents cooling and allows the temperature to increase in both the occupied and hazardous areas. As the airflow through the air conditioning coils has dropped, the PID controller continues to increase the amount of coolant to the coils until they freeze—which the freeze protection switch, or freeze stat, fails to prevent due to improper positioning. The regime that the facility is now operating within has also gone outside of the training for the neural network, but as the occupied period is reached, the neural network decreases the temperature set points without regard to the abnormal situation. [1]
Analysis
- Unexpected Faults – The failure of the damper that led to increased back-pressure. The supervisory neural network was trained for particular conditions, and could not react properly when this abnormal situation occurred.
- Human performance prediction – Here, the scenario does not include a human operator, but the human intent behind the design of the system is apparent. Those who are responsible for the creating and implementation of this system did not adequately account for unusual circumstances, and the migration of hazardous substances is a result.
- Multiple performance measures – Both temperature and differential pressure must be maintained by this HVAC system. In this system, there is no supervisory mechanism that enforces a priority among these goals. This scenario would indicate that pressure should be the higher goal, due to its ability to stop the migration of hazardous materials. In order to prioritize these goals, however, proper state awareness must be maintained.
- Lack of state awareness – This system is not aware when the damper fails, and thus is unable to maintain complete state awareness. If a deviation from status quo had been noted by some mechanism, it is possible that the system could have reacted accordingly by prioritizing goals, and providing a path for graceful degradation.
[1] C. G. Rieger, “Notional Examples and Benchmark Aspects Of a Resilient Control System,” 3rd International Symposium on Resilient Control Systems, August 2010.