A SERVICE OF

logo

14 RS/6000 7025 Model F80 Technical Overview
Dynamic CPU Deallocation
The processors are continuously monitored for errors such as L2 cache ECC
errors. When a predefined error threshold is met, an error log with warning
severity and threshold exceeded status is returned to AIX. At the same time, the
service processor marks the CPU for deconfiguration at the next boot. In the
meantime, AIX will attempt to migrate all resources associated with that
processor (tasks, interrupts, etc.) to another processor, and then stop the failing
processor.
The capability of dynamic CPU deallocation is only active in systems with more
than two processors, because device drivers and kernel extensions, which are
common to multi-processor and uni-processor systems would change their mode
to uni-processor mode with unpredictable results.
Persistent CPU and Memory Deconfiguration
CPUs and memory modules with a failure history are marked
bad
to prevent them
from being configured on subsequent boots. This history is kept in the VPD
3
records on the FRU
4
, so the information moves physically with the FRU and is
cleared when the FRU is replaced, and stays with the failed FRU when it is
returned to IBM. A CPU or memory module is marked bad when:
It fails BIST
5
/POST
6
testing during boot (as determined by the service
processor).
It causes a machine check or check stop during runtime and the failure can be
isolated specifically to that CPU or memory module (as determined by the
service processor).
It reaches a threshold of recovered failures (for example, ECC correctable L2
cache errors, see the preceding) that result in a predictive call-out (as
determined by service processor).
During CEC initialization, the service processor checks the VPD values and does
not configure CPUs or memory that are marked bad, much in the same way that it
would deconfigure them for BIST/POST failures.
I/O Expansion (RIO) Recovery
The RIO interface supports packet retry on its interface, which means that it will
automatically try to resend a packet if it gets no acknowledgment or a bad
response until a time-out threshold is reached.
RIO also supports a closed loop topology configuration, which is required for
RS/6000 products. RIO hubs will automatically attempt to reroute packets
through the alternate RIO port if a successful transmission cannot be completed
(for example, the retry threshold is exceeded) through the primary port.
Therefore, no single link failure in the RIO loop will cause the system to go down,
although the failure will be reported for deferred maintenance.
PCI Bus Error Recovery
As described in the PCI slot section, every slot is connected through a PCI-to-PCI
bridge chip to a primary PCI bus; thereby, each slot is logically and physically
isolated onto its own individual PCI bus. This fact provides a special error
3
Vital Product Data (VPD)
4
Field Replaceable Unit (FRU)
5
Built-in self-test (BIST)
6
Power-on self-test (POST)