Power System's PCI Bus EEH Error Recovery 기능 요약
Power System 에서 지원하는 EEH 기능은 PCI Bus error 발생 시 시스템 전체가 영향받거나 machine-check 이 발생하지 않도록해준다. EEH 는 PCI 에러에 의한 System HANG을 예방하고 PCI adapter 개별적인 reboot/recover 기능을 제공한다. PCI error 때문에 user & kernel 영역의 data corruption 이나 adpater hang, system crash 도 발생하지 않도록 해 준다. EEH 발생의 주된 원인은 Slot 접속 불량, Device Driver/firmware bug, PCI Card HW bug 등이다. PCI Adapter가 예약되지 않은 DMA 메모리를 access 하려는 Bug가 있을 때 EEH 가 보호해준다. 대부분의 EEH 에러는 접촉 불량 때문이기에 Adapter reseat을 통해서 해결할 수 있다.
- Isolate PCI errors within IO domains without affecting the rest of the system
- RTAS(Run-Time Abstraction Services) in the firmware provides PCI error related services to the OS
- The EEH core in the OS handles the error by either Taking appropriate corrective actions Or resetting the IO domain responsible for the error.
- Partitionable Endpoint (PE), PE is an I/O error and recovery domain made up of A adapter or Multiple IOA
<참고>
The IBM POWER-based pSeries and iSeries computers include PCI bus
controller chips that have extended capabilities for detecting and
reporting a large variety of PCI bus error conditions. These features
go under the name of "EEH", for "Enhanced Error Handling". The EEH
hardware features allow PCI bus errors to be cleared and a PCI
card to be "rebooted", without also having to reboot the operating system.
This is in contrast to traditional PCI error handling, where the
PCI chip is wired directly to the CPU, and an error would cause
a CPU machine-check/check-stop condition, halting the CPU entirely.
Another "traditional" technique is to ignore such errors, which
can lead to data corruption, both of user data or of kernel data,
hung/unresponsive adapters, or system crashes/lockups. Thus,
the idea behind EEH is that the operating system can become more
reliable and robust by protecting it from PCI errors, and giving
the OS the ability to "reboot"/recover individual PCI devices.
Future systems from other vendors, based on the PCI-E specification,
may contain similar features.
Causes of EEH Errors
--------------------
EEH was originally designed to guard against hardware failure, such
as PCI cards dying from heat, humidity, dust, vibration and bad
electrical connections. The vast majority of EEH errors seen in
"real life" are due to either poorly seated PCI cards, or,
unfortunately quite commonly, due to device driver bugs, device firmware
bugs, and sometimes PCI card hardware bugs.
The most common software bug, is one that causes the device to
attempt to DMA to a location in system memory that has not been
reserved for DMA access for that card. This is a powerful feature,
as it prevents what; otherwise, would have been silent memory
corruption caused by the bad DMA. A number of device driver
bugs have been found and fixed in this way over the past few
years. Other possible causes of EEH errors include data or
address line parity errors (for example, due to poor electrical
connectivity due to a poorly seated card), and PCI-X split-completion
errors (due to software, device firmware, or device PCI hardware bugs).
The vast majority of "true hardware failures" can be cured by
physically removing and re-seating the PCI card.
https://www.kernel.org/doc/Documentation/powerpc/eeh-pci-error-recovery.txt