Runtime memory faults during production run should be more thoroughly addressed because they severely affect system availability. This paper proposes a method for mitigating memory faults during production runs of deployed software, thereby ensuring normal system operation until patches to fix the faults are delivered. Furthermore, the method helps enhance debugging efficiency by providing accurate on-site fault information used by developers to release timely patches. The core of the method is to offer information tagging to identify runtime faults and a fault survival algorithm to provide differentiated fault mitigation according to the runtime state. We implemented ROPHE on a Linux 2.6 platform and conducted an empirical study of representative Linux applications. The results show that the average fault-handling rate among the applications is 35.75%, whereas the RemOte runtime Protection for High-risk Error (ROPHE) greatly improves capacity to an average of 91.94%. Specifically, the fault-handling rates of the applications ranged widely from 7.32% to 62.96%, while ROPHE provided fault-survival rates in the relatively narrow range of 82.35-97.44%. The experimental results show that the proposed method guarantees the same level of reliability for all applications regardless of their individual fault handling capacity.
- deployed software reliability
- fault mitigation
- fault survival
- runtime memory fault