Josh-Daniel S. Davis (joshdavis) wrote in eserver,
Josh-Daniel S. Davis
joshdavis
eserver

lspartition -dlpar, DCAPs values


POWER5 cheatsheet of good info and communication paths

This document is a rough outline of the function of POWER5 servers and the HMC, troubleshooting tips, and other related and useful information. This is not a formal document and is not a guarantee of function or support.

HMC Networking
      HMC Network config (including firewall)
      Enabling SSH to the HMC
      SSH keys into the HMC
      RMC port 657
      FSP connection and LPAR access
      Cloned LPARs may have the same ct_node_id
      SSH into the HMC
HMC Communication paths
      Communication from HMC to managed system
      Communication path for CEC power-on
      Communication path for LPAR Activation
Dynamic LPAR
      Dynamic LPAR (DLPAR) internals on POWER5
      Preparation for DLPAR remove from AIX
      Preparation for DLPAR from Linux
      Performing DLPAR remove of a physical device
      Performing DLPAR add of a physical device
      DLPAR common issues
Service Event reporting
Major differences between POWER4 and POWER5
Location Codes
      Mapping location codes
      What is an I/O Processor?
      What is HSL?
      p570 I/O location example
      Integrated Serial
      IDE controller locations
Performing system maintenance
      Enabling the update of system firmware from the Operating System
      Updating system firmware from the HMC
      Backing up your HMC
      Restoring your HMC
      Upgrading your HMC
Service Processor (FSP / ASMI) menu access
      Resetting the Service Processor
      ASMI menu listing
Advanced POWER Virtualization
      Virtual I/O Server
      Virtual SCSI
      Virtual Serial
      Virtual Ethernet
      Virtual ethernet for HMC communication
      Virtual ethernet etherchannel
      Micropartitioning
      Partition Load Manager
Slow discovery of servers
      p5 not seen by HMC
      Manually add the server by IP
      Autoscan can be retriggered
Common problems and troubleshooting
      Recommended Operating Systems for LPARs
      WebSM remote client strange behavior
      LPARs showing 00000000
      Not enough free memory
      Firmware lockup issues
      Root access to the HMC
      Server shows "Password Locked"
      HMC 4.2.1
      Partition Autostart
      Dual HMC
      Restricted commands
Medium level debugging
      Creating the HSCPE user
      PE Debug
      Rebuild Managed System
      No Connection
      Incomplete
      Version Mismatch
Related documentation


HMC Networking

This section is related to HMC network configuration.

      HMC Network config (including firewall)
      Enabling SSH to the HMC
      RMC port 657
      FSP connection and LPAR access
      Cloned LPARs may have the same ct_node_id

HMC Network config (including firewall)

  1. Open your WebSM GUI to the HMC
  2. Expand HMC Maintenance
  3. Choose HMC Configuration
  4. Choose Customize Network
  5. Choose the LAN Adapters tab
  6. Select eth0 and click Details
  7. Verify: Private, DHCP Server Enabled, Pick a smaller subnet.
  8. Inside Firewall, verify FCS Datagram, FCS and VTERM are allowed.
  9. click OK
  10. Select eth1 and click Details
  11. Verify: Public, "Enable Partition Communication", Proper IP and Netmask
  12. Inside Firewall, verify Secure Shell, WebSM, RMC, Web and Secure Web are allowed.
  13. click OK
  14. Select the Routing tab
  15. Default Gateway Device needs to be eth1 and not "ANY"
  16. Select the Identification tab
  17. Hostname needs to NOT be localhost.
  18. Choose OK
  19. Exit the HMC GUI
  20. Choose reboot (or if remote, ssh in and use "hmcshutdown -r -t now")

Enabling SSH to the HMC

  1. Verify the firewall is ok (above)
  2. Open your WebSM GUI to the HMC
  3. Expand HMC Maintenance
  4. Choose HMC Configuration
  5. Choose Enable or Disable Remote Command Execution
  6. Check next to SSH
  7. Reboot the HMC after changing.

SSH keys into the HMC

Use themkauthkeys command to add SSH keys to the HMC

RMC port 675

RMC port 657 in udp and tcp both directions needs to be open if there is a firewall between the HMC and the LPARs.

FSP connection and LPAR access

The FSP connection is not a network into the LPARs.

Cloned LPARs may have the same ct_node_id

The HMC is basically a cluster manager. /etc/ct_node_id must be unique on all systems in a cluster. All LPARs must therefore be unique.

If this is a new setup, /usr/sbin/rsct/install/bin/recfgct This will destroy all RSCT config and make a new node id This will clobber GPFS 2.1 and 2.2, HA 5.2, RMC and CSM

If you need to preserve an existing cluster, call support.



HMC communication paths

This is the general sequence of HMC communication paths. See also the DLPAR sectiom.

      Communication from HMC to managed system
      Communication path for CEC power-on
      Communication path for LPAR Activation

Communication from HMC to managed system

  1. Ethernet from HMC to FSP
  2. FSP is service processor with ethernet, serial and jtag
  3. Serial is disabled except in dev
  4. ethernet is used to talk to HMC
  5. jtag is used to work on the managed system itself

Communication path for CEC power-on

  1. IP connection to FSP
  2. FSP pokes firmware image into system RAM
  3. FSP starts CPUs and they begin running.
  4. General FirmWare starts
  5. Hypervisor (pHYP) starts
  6. pHyp builds tables for memory and for resource partitioning

Communication path for LPAR Activation

  1. IP to FSP, FSP to PHYP
  2. Instructs on which resources will be part of this virtual machine
  3. Copy of open firmware is made available to that virtual memory range
  4. CPU is told to run that code
  5. Queries for devices go to PHYP rather than GFW
  6. PHYP brokers and arbitrates what can be seen and performs some virtual memory management

Dynamic LPAR

This section contains various manner of information related to Dynamic Logical Partitioning.

      Dynamic LPAR (DLPAR) internals on POWER5
      Preparation for DLPAR remove from AIX
      Preparation for DLPAR from Linux
      Performing DLPAR remove of a physical device
      Performing DLPAR add of a physical device
      DLPAR common issues

Dynamic LPAR (DLPAR) internals on POWER5

  1. HMC tells FSP to populate values
    "Here is my IP address, my name to you"
  2. rsct.core.rmc includes rmcd and rsct.core.sec includes ctcasd
  3. ctrmc (rmcd) starts and listens/talks on port 657
  4. ctrmc starts ctcas (ctcasd) as necessary to authenticate
  5. ctrmc starts the IBM.*RM (IBM.*RMd) processes as necessary
  6. IBM.CSMAgentRM (part of csm.client fileset) pulls HMC IP from RTAS
  7. LPAR fills IBM.ManagementServer rsct class (like CSM does)
  8. LPAR opens RMC from ephemeral to HMC port 657 tcp
  9. TCP and UDP ephemeral to 657 are used to sync up and stay sync'd
  10. HMC adds the LPAR to IBM.ManagedNode rsct class

At this point, lspartition -dlpar on HMC will show DLPAR status. You must be logged in as hscpe with role hmcpe for this to work. DCaps maps out to hex with these bit values:

       0 - DR CPU capable  (can move CPUs)
         1 - DR MEM capable  (can move memory)
         2 - DR I/O capable  (can move I/O resources)
         3 - DR PCI Bridge   (can move PCI bridges)
         4 - DR Entitlement  (POWER 5 can change shared entitlement)
         5 - Multiple DR CPU (AIX 5.3 can move 2+ CPUs at once)
0x3f = max, and 0xf is common for AIX 5.2

Preparation for DLPAR remove from AIX

None of this is needed for CPU or memory.

  1. # lsdev -Cl -F parent
  2. Follow the chain up to the pci bus of the slot
    1. for a 4-port ethernet this is not the first PCI bus you find.
    2. For the CD-ROM on a p5 server, this will be an IDE device
  3. # rmdev -Rl pci#

If this works, you are set. If not, see the Troubleshooting section below.

Preparation for DLPAR from Linux

DLPAR requires Linux kernel 2.6.8 or greater.
IBM's service tools for Linux include DLPAR prerequesites, Service reporting, and the list of supported distributions.
This is still pretty new (2005-03-15).

Performing DLPAR remove of a physical device

If the device was marked as "Required" when the LPAR was activated, then you cannot remove it. Edit the profile, halt, activate again to pick up the new change. "reboot" won't work.

  1. Go into the HMC / WebSM GUI
  2. Server and Partition -> Server Management
  3. Expand the server
  4. Expand Partitions
  5. Right click on the LPAR itself
  6. Choose Dynamic Logical Partitioning
  7. Choose Physical Adapter I/O Resource
  8. Choose Remove
  9. Pick the slot
  10. Choose remove
  11. The HMC sends a command via RMC to the LPAR
  12. rmcd passes this to IBM.DRM (devices.chrp.base.rte)
  13. IBM.DRMd calls drmgr to perform the change
  14. drmgr commands hypervisor to make the change
  15. IBM.DRMd passes the result back to the HMC
  16. The HMC updates it's internal representation of the LPAR & System

Performing DLPAR add of a physical device

The LPAR must be up to DLPAR resources to it. If it's down, just add to the profile properties.

  1. Add it from the GUI the same way you removed it above
  2. log into the lpar and use "cfgmgr" to find it
    (not needed for memory/cpu)
  3. On Linux, I believe you use "modprobe \*"

DLPAR common issues

If you get an RMC connection error, then port 657 from lpar to HMC failed to negotiate.
If you get an error from AIX, the device probably wasn't removed.


Service Event reporting

All of the same mechanisms apply here as with DLPAR. The difference is IBM.ServiceRM is used, not IBM.DRM.
IBM.ServiceRM will check errpt for permenant hardware errors
NOTE: HMC powered down for too long can be considered a perm HW error.


Major differences between POWER4 and POWER5

p4: IBM.CSMAgentRM uses HMC hostname as authentication
p5: IBM.CSMAgentRM uses HMC IP address which is MUCH easier

p4: firmware is called "microcode"
p5: firmware is called "Licensed Internal Code" or LIC

p4: service processor is a vterm to the CEC when powered off
p5: service proc is webserver on FSP IP at any time.

p4: serial connection to CEC from HMC
p5: ethernet connection from CEC to HMC

p4: Service Processor is similar to AS/400 CSP.
p5: Service processor is the same as AS/400. AIX and i5/OS can run on same system.


Location Codes

Location codes are always important, usually moreso when you're trying to rebuild your system. This section contains information on how to decode and map location codes. Remember to build your map during install/build rather than trying to guess later.

      Mapping location codes
      What is an I/O Processor?
      What is HSL?
      p570 I/O location example
      Integrated Serial
      IDE controller locations

Mapping location codes

There is no map to AIX location codes from physical location codes. lscfg -vl and lsdev -Cl can be compared for device mapping.

The physical location codes look daunting at first, but make sense in the end. For example:

      U7845.001.3975C3-P1-C5
This would be machine type 7845 (9406 or 9111 type won't be shown).
Then .001 will be the first drawer of it.
p570s and smaller don't have a .002
The next # is the serial number of the drawer to help locate.
This MIGHT not match the sticker on the front (Call HW)
P1 = Planar 1 (most I/O drawers have only 1)
C5 means card slot 5.

The most common other "slot" types include:
T = Terminal, or built in port. On a p570, T15 is the IDE controller for the DVD-ROM and T12 is an integrated SCSI.
For a p570 with only one CD-ROM, the default location if Bus 3, Slot T15 which is the top ultrabay 2000 slot in the front of the CEC with the operator panel for multi-drawer systems.
Inside the WebSM GUI, IDE is called "Other Mass Storage Controller".

Each POWER5 CEC has 2 Ultrabay 2000 non-hot-swap ports. If they are not allocated to an LPAR, They can be disabled in ASMI for drive removal. See the section on Service Processor Menus for details.

Infocenter will have more info on location codes under:

  1. http://publib.boulder.ibm.com/eServer
  2. North America
  3. Hardware
  4. Service and support
  5. Service provider information
  6. Resolving problems
  7. Finding Part Locations

See Also the Advanced POWER Virtualization section for information on virtual devices.

What is an I/O Processor?

An I/O adapter connects an I/O device to the I/O processor. Both the I/O adapter and I/O processor work to control the I/O device. Several I/O adapters can belong to an I/O processor.

Only one I/O processor is allowed per system I/O bus. I/O processors process instructions from the system. They offload this work and manage the I/O adapters and devices. Buffering and command queueing are performed.

Most high throughput I/O adapters require the use of an IOP to be attached to an i5/OS LPAR. Commonly, there will be one IOP per bus for an AS/400.

Directly attached I/O resources are under the control of the operating system. The operating system manages the hardware resources without the use of an I/O processor. Storage and LAN adapters can be allocated to a partition directly if integrated to the system.

RPA style LPARs, AIX or Linux based ones, cannot use an I/O Processor. Any bus that contains an I/O processor is owned by that IOP. Move the IOP or use a different bus.
NOTEThe VIO server uses AIX (v2 may use Linux).

What is HSL?

HSL stands for High Speed Loop. HSL is also known as RIO or Remote I/O. HSL-2 is known as RIO-G, RIO-2 or RIO-Plus.

p570 I/O location example

Here is a list of an example P570 with an AS/400 I/O Tower and internal PCI raid feature 5709.

bus_id drc_name,                  Feature Code   description
1      U7879.001.XXXXXXX-P1-T4    null           Universal Serial Bus UHC Spec
1      U7879.001.XXXXXXX-P1-T6    5706           PCI 10/100/1000Mbps Ethernet UTP 2-port
1      U7879.001.XXXXXXX-P1-T14   null           Empty slot
2      U7879.001.XXXXXXX-P1-C3    2844           PCI I/O Processor
2      U7879.001.XXXXXXX-P1-C4    2849           PCI 100/10Mbps Ethernet
3      U7879.001.XXXXXXX-P1-C1    2844           PCI I/O Processor
3      U7879.001.XXXXXXX-P1-C2    2793           PCI 2-Line WAN w/Modem
3      U7879.001.XXXXXXX-P1-T12   5709           PCI RAID Controller
3      U7879.001.XXXXXXX-P1-T15   null           Other Mass Storage Controller
12     U5294.001.YYYYYYY-CB1-C07  2844           PCI I/O Processor
12     U5294.001.YYYYYYY-CB1-C08  2749           PCI Ultra Magnetic Media Controller
12     U5294.001.YYYYYYY-CB1-C09  5702           Storage controller
16     U0588.001.ZZZZZZZ-CB1-C11  2892           PCI Integ xSeries Server
16     U0588.001.ZZZZZZZ-CB1-C12  null           Empty slot,null
16     U0588.001.ZZZZZZZ-CB1-C13  null           Empty slot,null
16     U0588.001.ZZZZZZZ-CB1-C14  2844           PCI I/O Processor
17     U0588.001.ZZZZZZZ-CB1-C01  2844           PCI I/O Processor
17     U0588.001.ZZZZZZZ-CB1-C02  2749           PCI Ultra Magnetic Media Controller
17     U0588.001.ZZZZZZZ-CB1-C03  2844           PCI I/O Processor
17     U0588.001.ZZZZZZZ-CB1-C04  2749           PCI Ultra Magnetic Media Controller
18     U0588.001.ZZZZZZZ-CB1-C05  null           Empty slot
18     U0588.001.ZZZZZZZ-CB1-C08  null           Empty slot
18     U0588.001.ZZZZZZZ-CB1-C09  null           Empty slot

NOTE, T12 also contains T13. They are used for the internal drives. Both are SCSI controller ports on the same integrated card.
Without the raid feature, T12 shows as "Storage Controller".
Note, T14 doesn't physically exist anywhere, but is also a Storage Controller at some firmware levels.
U787A.001.XXXXXXX-P4-D2 is p520 DVD, AIX location code 03-08-00

Integrated Serial

The integrated serial ports cannot be used if an HMC is attached. If no HMC, you can use them from full system. Only terminal supported - no heartbeating or data.

IDE controller locations for p5 servers

510 = 
520 = T12 = Other Mass, T10 = mass
550 = Bus 2, Slot T16
570 = Bus 3, Slot T15
575 = 
590/595 = external SCSI 1U drawer - will be a SCSI bus

Performing system maintenance

This section covers the basics of updating, backing up and restoring your HMC and system firmware.

Enabling the update of system firmware from the Operating System

This is required of Temporary side firmware is corrupt. Otherwise, this is just for user preference.

  1. Open the HMC GUI
  2. Choose Server and Partition
  3. Choose Server Management
  4. Right click on the Managed Server
  5. Choose Properties
  6. Choose the Service Partition
    Current and new partition must already be down
  7. Go back to the top level of the HMC GUI
  8. Choose Service Applications
  9. Choose Service Focal Point (SFP)
  10. Choose Service Utilities
  11. Select the managed server
  12. Pull down Selected
  13. Choose Launch ASM Menu...
  14. Opera, the web browser, will launch and request security authorization
  15. Click OK.
  16. If the page stays blank, close Opera and retry as necessary.
  17. Login as admin/admin or admin/abc1234
    If the password is neither, you may need to reset your service processor
  18. Expand System Configuration
  19. Choose Firmware Update Policy
  20. Set to update from the OS and click OK or Save
  21. Expand Power/Restart Control
  22. Choose Power On/Off System
  23. Make sure it's set to temporary side
  24. If it is not, you will need to shut down your LPARs prior to making this change.
  25. Activate the Service Partition if it's not running
  26. Use normal OS based system firmware update utilities.

RPM or Self Extracting Archive may be downloaded from IBM Fix Central under Microcode Downloads to get the image.image.

Depending on your operating system and preference, you may use any of DST, SST, update_flash, shutdown -u, or diags to update

NOTE: If you are using an HMC, please make sure your HMC is current code level prior to updating system firmware.

Updating system firmware from the HMC

Please make sure your HMC is current code level prior to updating system firmware.

  1. Open the WebSM GUI
  2. Expand Licensed Internal Code
  3. Choose Licensed Internal Code Update
  4. Use Update LIC to update your code
  5. Choose your microcode repository from the list (internet, CD, etc)
  6. Follow the prompts.

If going from SF220 to SF222, may need to use "Material Equipment Specification" rather than "Update LIC".
The microcode CD is available from Fix Central under Microcode Downloads.

Backing up your HMC

Any time you make changes to your HMC, you should back it up.

        Profile Data Backup
        Back Up Critical Console Data

Profile Data Backup

Do this for each server. Backup filenames do not collide between servers

  1. Open the WebSM GUI
  2. Expand Server and Partition
  3. Choose Server Management
  4. Right click on the server
  5. Choose "Profile Data"
  6. Choose "backup"

Backup Critical Console Data

Use this to back up all changed files from your HMC to DVD-RAM or an FTP Server. The DVD backup is a tgz file on UDF filesystem with absolute paths

Your backup media must be DVD-RAM, not DVD-RW, DVD+r, or MO. You will see hard sector markings on the data side of the media. Your system came with one blank with a white lable, black print.

You must pre-format the DVD-RAM if not already done. If this is a never before used DVD-RAM, do the following:

  1. Format removable media
  2. Put in your disc and say OK

To perform the actual backup:

  1. Open the WebSM GUI
  2. Go to Licensed Internal Code
  3. Go to HMC Licensed Internal Code
  4. Choose "Backup Critical Console Data"
  5. Choose to send this to an FTP server or a DVD-RAM

Restoring your HMC

Restoring your HMC is the same as installing it for all but the last step. The main thing to consider is that if your backup was at HMC 4.3, then you cannot install with HMC 4.4 media and expect the 4.3 backup to work. The HMC won't know the difference and will leave the system in an inconsistant state.

So remember, the recovery media must be the same version and release as the CCD, and at or below the maintenance level of the CCD.

        Initial Reinstall
        Recovery of Changes

Initial reinstall

  1. Power down your HMC
  2. Power up your HMC
  3. Choose [F1] to enter BIOS
  4. Choose System Startup Options
  5. Choose System Boot Sequence
    This is hard to see. It is at the top and looks like a heading.
  6. Make sure the CD-ROM is listed before hdisk.
  7. Place disc 1 of the HMC Recovery Media in the drive.
  8. Exit BIOS saving changes
  9. The installer will load
  10. Choose Install (vs upgrade)
  11. Wait for it to finish rebuilding the basic system image.
  12. Wait for it to reboot
  13. When it prompts you for additional RPM discs, put in disc 2 (Additional RPMs)
  14. When it prompts you for additional RPM discs, put in disc 3 (Additional RPMs)
  15. When it prompts you for additional RPM discs, put in disc 4 (InfoCenter)
  16. When it prompts you for additional RPM discs, you can either install non-us Infocenter, or choose Finish

Recovery of Changes

At the end of the install, the HMC will prompt you for a Critical Console Backup.

  1. When it prompts you for a Critical Console Backup, insert you DVD-RAM disc if you have one.
  2. When all is complete, the system will boot.
  3. Log in to the HMC and verify your settings or reconfigure as applicable.

Upgrading your HMC

In summary, here's how to update your HMC between major releases.

        Warning
        Recovery Media method
        Install Corrective Service method

Warning

Prior to performing any of these steps, you should first make sure your HMC is backed up. If your system becomes unstable or you make a mistake, you will need the backup disc and your Recovery Media to restore your HMC.
Also, Make sure to read through these procedires FIRST, as they make reference to other sections, then offer adjustments or corrections.

Recovery Media method

We start by storing the HMC data onto a nonvolatile portion of the disk. If your HMC has excessive core files in /home/hscroot, remove these first. If your HMC fails this process, service agent may need to be started, Or you may have other bulk. Contact support.

  1. Open the WebSM GUI
  2. Choose Licensed Internal Code
  3. HMC Licensed Internal Code
  4. Save Upgrade Data

Next, perform an Initial Install of your HMC. NOTE: There are 2 differences.

  1. Choose "Upgrade" rather than "Install"
  2. Do NOT recover your critical console data backup.
If your first attempt to boot from the recovery media fails, then you will need to make a new "Save Upgrade Data" backup.

The Save Upgrade Data backup will be automagically restored when the HMC boots.

Install Corrective Service method

  1. Open the WebSM GUI
  2. Choose Licensed Internal Code
  3. HMC Licensed Internal Code
  4. Install Corrective Service
  5. Choose "Removable Media" for update CDs,
    or choose FTP to update from the Internet.
  6. If updating from CD, put in disc 1.
    If updating from FTP, put in the location of your first zipfile.
  7. When it is complete, DO NOT REBOOT
  8. Repeat the above steps for the other CDs or ZIP files provided.
  9. When ALL related update media are installed, exit the HMC GUI.
  10. When prompted, choose "Restart" or "Reboot".

Service Processor (FSP / ASMI) menus

To access ASMI manually from another system, point your browser to https://fsp_ip_address. The HMC2 port on the FSP is commonly used for this.

When either FSP port is not provided a DHCP address on FSP boot, the default IP for HMC2 is 192.168.3.147 and for HMC1 is 192.168.2.147. Otherwise, the IP of the HMC1 port as defined in the HMC.

$ lssyscfg -r sys -F name,ipaddr
will show from command line. If this is inaccurate or out of date,
$ lshmc -n
will show currently assigned DHCP client addresses. You can use
ping
to see which ones are alive.

If you don't have access to the FSP network, you can access ASMI from the HMC. As of the time of this writing, 2005-03-15, this still requires you to be local to the HMC. Plans were underway for an HTTP proxy to allow this from the remote client, possibly in 4.4.x.

  1. Open the WebSM GUI
  2. Choose Service Applications
  3. Choose Service Focal Point (SFP)
  4. Choose Service Utilities
  5. Select the managed server
  6. Pull down "Selected"
  7. Choose "Launch ASM Menu..."
  8. If Opera (web browser) comes up blank page, close and retry.
  9. Log in as admin/admin, admin/abc1234, or whatever it's been changed to.

If you have lost the password to ASMI, you will need to reset the service processor.

Resetting the service processor

If you plugged in power to p5 server before HMC is configured, your FSP is probably on the default IP and no longer looking for a dynamic IP on a subnet the HMC can see. FSP defaults to 192.168.2.147 on HMC1 FSP defaults to 192.168.3.147 on HMC2 Must unplug p5 to reset. Sometimes must pull FSP and toggle dip switches FSP reset is documented in InfoCenter with pictures FSP reset can be performed from ASMI also.

ASMI menu listing

. If you are in, Here is a list of the menus (for reference). Of course, each option has it's own page, which is not included here.
   Collapse all menus             
   Expand all menus               
    Power/Restart Control      
        Power On/Off System    
        Auto Power Restart     
        Immediate Power Off    
        System Reboot          
        Wake On LAN            
    System Service Aids        
        Error/Event Logs       
        Serial Port Snoop      
        System Dump            
        Service Processor Dump 
        Serial Port Setup               
        Modem Configuration             
        Call-Home/Call-In Setup         
        Call-Home Test                  
        Reset Service Processor         
        Factory Configuration           
    System Information                  
        Vital Product Data              
        Power Control Network Trace     
        Previous Boot Progress Indicator
        Progress Indicator History      
    System Configuration                
        System Name                     
        Processing Unit Identifier      
        Configure I/O Enclosures        
        Time Of Day                     
        Firmware Update Policy         
        PCI Error Injection Policy     
        Hardware Deconfiguration       
            Deconfiguration Policies   
            Processor Deconfiguration  
            Memory Deconfiguration     
        Program Vital Product Data     
            System Brand               
            System Keywords            
            System Enclosures          
        Service Indicators             
            System Attention Indicator 
            Enclosure Indicators       
            Indicators by Location code
            Lamp Test                  
    Network Services                   
        Network Configuration    
        Network Access           
    Performance Setup            
        Logical Memory Block Size
    On Demand Utilities          
        CoD Order Information    
        CoD Activation           
        CoD Recovery             
        CoD Command              
    Concurrent Maintenance       
        Control Panel            
        IDE Device Control       
    Login Profile                
        Change Password          
        Retrieve Login Audits    
        Change Default Language

Advanced POWER Virtualization

APV is a general name for the following features:

APV is provided by the following feature codes:

   Server				Feature Code	Included by default?
   IBM eServer p5 9111 Model 520	7940		No
   IBM eServer p5 9113 Model 550	7941		No
   IBM eServer p5 9117 Model 570	7942		No
   IBM eServer p5 9119 Model 590	7992		Yes
   IBM eServer p5 9119 Model 595	7992		Yes
   IBM OpenPower 720			1965		No
   IBM eServer p5 9118 Model 575	TBD		TBD
   
If this was included in your order, visit the On Demand registration site to retrieve your code.

If your code is missing from this site,
Verify you can find the feature above on your order.
If it's missing, call your salesperson.
If it is there, check to see if you bought from a IBM directly.
If you did not, contact your business partner.
If you did, contact the IBM Rochester Quality Hotline. This should be included in your paperwork.

Rochester Quality Hotline Number is 800-426-4356
   5 bad order
      1 iSeries
      2 storage systems
      3 pSeries general
      4 7022/7023
      5 storage networking

Virtual I/O Server

Information and patches relating to the VIO server can be found at the VIO Server website.

This is a special O/S build, black Box partition. Currently it is AIX and is installed via a mksysb CD. Future versions may be Linux.
As of 2005/03/15, Virtual I/O server PTF 4 is HIGHLY recommended.

Virtual I/O requires AIX 5.3. ML1 is HIGHLY recommended.
Support for VIO is not being backported to AIX 5.2
Original -00 base CDs do not work.
Original -02 base CDs have problems.
Original -01 base CDs work OK.

There is support for multiple I/O servers on the same managed system.

Virtual SCSI

Clients can use MPIO or LVM mirroring. The VIO server supports EMC powerpath, SDD vpaths, Hitachi HDLM, MPIO, direct SCSI, fibre and SSA.

LVM mirroring on the VIO server should generally not be used. Mirror retry is not automatic, and Client AIX has no way of knowing if a disk error is from a mirrored backing device. A fault in one mirror would cause every other I/O to fail.

It is best to mirror across two separate VIO servers.

Virtual SCSI handles DMA directly and commands virtually via:
SRP - SCSI Remote DMA Protocol LRDMA - Logical Redirected DMA

It is actually recommended to run at least 2, so that virtual VGs can be mirrored across 2 "controllers". This way, if one VIO server needs patches, or hangs, the client LPARs can stay running.

When building LPAR profiles, you absolutely MUST make a chart, on paper, by hand. If you don't do this, you will loose track of the crossconnects. There is no wizard to do this for you. The Advanced POWER Virtualization on IBM eServer p5 Servers: Introduction and Basic Configuration has good reference.

Virtual serial

While you have 2 by default, and can add more, only the first one is used. It's not known if there will be a future for sharing an async card.

Virtual ethernet

All clients on the same port VLAN ID can talk to eachother. AIX VLAN adapters can be built, but this is rarely used. NIM can be used to install over this if you have a NIM server on the same virtual ethernet VLAN.

If using multiple VLAN IDs, then use "Additional VLAN IDs" and create AIX VLAN adapters. veth (ent0) can have additional (VID) ethernets (eth) for different vlans, and then enX off of the virtual adapters is used for TCPIP. AIX cannot be installed over an AIX VLAN adapter, but can be installed over the Virtual ethernet using the port vlan ID's network.

Trunk Adapter setting is ONLY for the I/O server and effectively means all traffic for that PVLANID will go to that VIO server's adapter. Only one VE trunk per VLAN in initial release.

If using one VLAN ID, use Port VLAN ID. This configures normally from the LPAR as an ethernet card.

When building virtual ethernet adapters, the physical location code will be in the form of *-Vx-Cy. x is the LPAR ID number and y is the virtual I/O slot.

The Shared ethernet device can be used for I/O server connectivity as well as client LPAR access.

HMC creats MAC addreses based on serial number and lpar ID 20 VLANs (VID) per adapter max 256 virtual ethernet adapters per AIX LPAR 65394 max MTU support

Virtual ethernet for HMC communication

If you want to do DLPAR or talk to any other system (eg, HMC?) you need to be able to ping outside of the virtual ethernet

You will need a linux LPAR or a VIO server for bridging. This will need to have virtual ethernet card with same pvlan ID and "Trunk Adapter" set. This will need a physical ethernet adapter. In the VIO server, you build a Shared Ethernet (See the redbook). In Linux, you will create a bridge (see BRIDGING-HOWTO)

It is possible to use AIX, Linux of i5/OS as a router. This is less CPU load than bridging, and doesn't require trunk settings. This will require corporate network routes to point to the p5 server.

No inter-switch protocol support provided by the AIX Shared Ethernet Adapter (bridge).

Virtual ethernet etherchannel

Physical access can be single or aggregate adapters. Some problems still exist with multiple adapters.

Virtual I/O server etherchannel is not happy yet (1.1.4 & SF222_075) Packets get discarded when exiting the frame. Packets get discarded for any LPAR not on same vlans

Micropartitioning

Micropartitioning means fractional CPU. The limits are:
.10 minimal
.01 increments
Max 1 virtual proc per proc (or SMT)
Max 64 real (or 128 SMT) procs per LPAR
254 virtial processors per CEC

Entitled Capacity defines the fractional CPU entitlement for an LPAR. In the profile, the LPAR is specified to have:

  • Amount of processing units in % of CPU
  • desired, min, max number of CPUs
  • Capped: entitled capacity is maximum allowed
  • Uncapped: entitled capacity is minimum allowed
    ceced and idle capacity may be used
  • variable capacity weights (0-255) for uncapped
    0 = soft cap to prevent exceeding entitled capacity
    otherwise, it's comparative against other weighted LPARs
Changes to entitled capacity occur on the fly.

The AIX kernel in 5.3 was modified to reduce PHYP context switching and to disable spin locks when using fractional CPU. Idle time is ceded to the free pool by the wait process. The wait process receives notice of wake-up and continues as necessary.

Dynamic reconfiguration can be used for virtual cpus, entitlement, and variable weight.

smtctl on/off can be used to enable Symmetric Multiprocessing. bindprocessor shows primary and secondary bind IDs for the SMT CPUs.

HPC should use dedicated and no SMT. SMT can pose a problem for virtual ethernet if enabled and fractional CPUs are moved.

Performance tools have been enhanced to show details about micropartitions:

  • lparstat shows % entitled and $ logical and % shared pool idle, and hypervisor stats.
  • lpar_get_info() determine theoretical latency, min, max, guarantee
  • curt, splat, filemon, pprov, netpmon
  • trace data now exists for vproc preemption
  • hoping ISVs use the API to determine for licensing how many physical CPUs

This is HW/UCODE feature and not tied to O/S. AIX change is exploitation only and not enablement.

Poorly behaved applications may tie to physical CPU (Supervisor mode), or may spinlock and eat free cpu.

Partition Load Manager

PLM can manage DLPAR resources on the fly between multiple LPARs based on user defined policy. Can be used for both CPU and memory. Not very useful for uncapped processors.

PLM can be used with shared and dedicated CPUs and supports AIX 5.2 ML4 (5.2H) and AIX 5.3.

PLM server should be a separate system, but can be an LPAR too. 1 PLM instance can manage an entire p5 server. Multiple PLM instances may be run on the same PLM server.

Configuration is via WebSM applet or flat files. If no PLM policy file, uses values from active profile on HMC.

Provides realtime load and allocation statistics. Notifications are:
   red = requesting
   green = donating
   yellow = happy and no notification.
Monitor mode allows testing without actual resource move.

PLM is RSCT based and relies on name resolution. RSH is used to set up PLM for RSCT communication. SSH to HMC is used to perform DLPAR against nodes. IBM.DRM sends event notification to PLM server.

Amounts moved are minimally 10%. If only 9% left in range, no move will occur.

PLM pulls from free pool first if available. If not, then it will pull from the lowest load LPAR. If CPUs are idle, they are ceded back to the free pool. Defaults help to prevent prevent thrashing. PLM changes take several seconds to several minutes to perform.
It is possible to reserve resources outside of a managed group by choosing no pool ID in the LPAR profile.

Weight is how many cpus over "desired" the CPU should have. Virtual processors can use Capacity and # of virtual processors. Smaller number of virtual processors is more efficient. This is because all vprocs have same % within the same LPAR. Large number of minimum-use virtual processors will spend more time context switching.


Slow discovery of servers

The HMC network scan takes some time. In general, don't be alarmed if it takes 2 hours for a new server to show up.

To speed this up, you may choose to use a class C network (x.x.x.2 -> x.x.x.254 in the HMC).
You may also choose to add the server manually rather than wait.

p5 not seen by HMC

A common problem is that the p5 server is plugged in prior to the HMC DHCP server being set up properly.
Another can be slow scan of the network, or the HMC has lost track of the FSP when it changed IPs.

Manually add the server by IP

  1. $ lshmc -n from command line
  2. "clients=" at the end are DHCP addresses
  3. ping these to see which are alive
  4. Open the WebSM Gui
  5. Expand Server and Partitions
  6. Choose Server Management
  7. Pull down "Selected"
  8. Choose Add
  9. Fill in the IP address from above

Autoscan can be retriggered

$ mksysconn -o auto


Common problems and troubleshooting

Recommended Operating Systems for LPARs

  • AIX 5.3 ML1
  • Virtual I/O Server v1, PTF 4
  • SuSE Enterprise Linux 9, 2.6.8 or better kerne.
  • i5/OS (OS/400) 5.3 (unk patches required for stablility)

WebSM remote client strange behavior

Using old WebSM code will lend to strange behaviour remotely. If you find that your function works fine from the HMC itself, but not remotely, perform these actions:

  1. Uninstall WebSM remote client
  2. Clean up C:\Program Files\websm or wherever you have/had it.
  3. Remove WSMDIR environment variable from system Control Panel or from /etc/environment.
  4. Visit http://hmcname/remote_client.html
  5. Download the installshield version if you don't have Java 1.4.2_03
  6. You can use the Java Webstart version if your JVM is 1.4.2_03
  7. Follow the installation prompts
  8. Try using WebSM again.

LPARs showing 00000000

LPARs showing 00000000 for the Operator Panel value are actually down, but PHYP hasn't destroyed the LPAR. If you add the resources to another LPAR & activate, they're freed If the request is for memory or CPU, the system will try to use other memory first before dipping into a recently deactivated LPAR.

Not enough free memory

Memory can become fragmented because of the memory allocation rules as above. You may find that an LPAR fails to activate for not enough contiguous memory. If this happens

  1. Verify memory isn't lost:
    $ rsthwres -r mem -m
  2. If still can't start, continue.
  3. Shut down all LPARs
  4. Activate the "Use All Resources" LPAR.
    One came with the system as the serial number, but might have been removed or renamed.
  5. Wait for a state of "Starting"
  6. Perform an Immediate Shutdown of the LPAR
  7. Start your LPARs individually

Firmware lock-up issues

As of 2005/03/14, there are still many firmware lock issues requiring power cycle of a p5 server.

Many of these are alleviated by upgrading to HMC 4.4 and system firmware SF222_081.

It is strongly recommended to upgrade to HMC 4.4.1 and system firmware SF225_* once they become available. These should be available sometime in April.

Root Access to HMC

If needed, support can get in as root. 99.9999% of what you need to do can be done non-root. Passwords are 1 day hash and are support directed.

If you think you need root, chances are we have a non-root solution for you.

No custom software may be installed.
Yes, it's SuSE 8.1 (as of 3.3.4 and 4.4.0).
No, you aren't allowed to export X sessions.

Server shows "Password Locked"

If adding the server failed and is in "Password Locked"

  1. log into ASMI as admin
  2. Select User Management
  3. change the password for the user "HMC"

HMC 4.2.1

HMC 4.2.1 should generally be avoided. 4.2.1 base from the factory often could not see managed servers because hdwr_svr would not autostart. Also, it did not have EXT3 filesystem and could corrupt filesystems on hard power off.
It is recommended to use 4.3.2 or later via upgrade install or reinstall.

Partition Autostart

Partition Autostart should only be used after a p5 server is properly lparred. This is great for full system image, or for a system who's LPARs won't change often.

Problems include that older firmware wouldn't free resources after stopping some of the LPARs.

Use Partition Standby for CEC power on mode under most situations.

Dual HMC

Limitations for Dual HMC include>

  • For Dual CR2 HMC to work, the HMCs must be 4.2.1 plus patches, or later.
  • May not have 2 HMCs on the same private network.
    DHCP servers will conflict
  • May not have HMC1 port and HMC2 port of the same FSP on the same subnet.
    Every other packet lost - HMC shows "incomplete"

Restricted commands

Use man cmd from HMC for help on most commands.
Use ls /usr/hmcrbin to see what all you can run as hscroot
Use ls /hmcrbin to see what more you can run as hscpe


Medium level debugging

If power cycling the CEC and then HMC doesn't help, This is more info fro SF222_081 and earlier. Some root access may be required. This assumes a working knowledge of "pedbg" command on the HMC This assumes general HMC familiarity.-

      Creating the hscpe user
      PE Debug
      Rebuild Managed System
      No Connection
      Incomplete
      Version Mismatch

Creating the hscpe user

To create the hscpe user:

$ mkhmcusr -u hscpe -a hmcpe -d "IBM Service"
For the password, use 7 or more characters, alpha-numeric only

PE Debug

Use "pedbg" as user hscpe to gather debugging information. Almost anything indicated below will be gathered by pedbg into a zipfile.

pedbg - run as hscpe and this -don and -doff for tracing
      some stuff requires reboot to trace though
   -c  collect trace to /tmp or DVD
   -r  removes the file
   -l  subsystem name
Use the man page and IBM Support for more details.

Rebuild Managed System

Rebuild Managed System forces comm to restart on HMC and FSP. If this doesn't work:

      llcmd localhost 9734
If no response, hrdw_svt can't communicate to FSP.
Get pedbg information for cimserver.log.
Reboot to fix.

No Connection

Verify environment settings
Is this dhcp or static?
If dhcp do you see the ipaddresses the dhcp server(HMC) has given?

         lshmc -n  (look for clientip=)
         tail /var/lib/dhcp/dhcpd.leases
         tail /opt/hsc/data/HmcNetConfig
If static what's the ipaddress the admin assigned to the FSP?

Verify network connectivity
Can you ping the fsp?
What is the subnet on the network on the HMC compared to the FSP settings?
Use a different subnet for a larger range, 255.255.0.0 on the HMC if unable to connect through ASM

What is the duplex/data rate?
If the HMC is set to auto/auto or 1000/full, try 100/full on the HMC and retry.

Verify cabling/hub and switch setup?
Do direct connections work?
Does the other HMC port on the FSP work?

Check for ASM access to the FSP.
Can you get into the fsp either from Service Utilites on the HMC or directly connected with a laptop?

hdwr_svr is constantly polling b/t HMC and CEC. Look for SSL errors in the hrdw_svr trace.

If FSP is utterly dead, contact 800-IBM-SERV for hardware support.

Recovery

The HMC runs checksum to the file in the FSP save area and compares to backup file on HMC.

The CEC is marked "Recovery" if the save area checksum is null, indicating the profile data is no longer part of the NVRAM.

Use "Recover Partition Data" on the CEC. If a profile data backup, or good backupFile, then profiles can be restored. If that fails, use Initialize and rebuild LPARs by hand.

ALWAYS KEEP CURRENT PROFILE DATA BACKUPS!

Incomplete

This means the HMC

  • sent a command w/o response, AKA network connectivity problems.
  • sent command ang got data, the FSP is waiting on PHYP, and times out.
Timeout is 60 seconds. PHYP may respond after several minutes.

If doesn't work at 5 mins, re-run rebuild manged system. If still stuck, get pedbg output for cimserver and hrdw_svr traces.

If power cycle of the CEC does not work, May need to have hw take fsp dump (hours)

Power off of the managed system and power on will clear a mailbox hang condition.

CMVC defect 490077 and 486129
May be fixed in GA4 PHYP 2.2.m5 (sf225_XXX)
cimserver.log shows:

               invoke error, status is 1012  
               Error detected. Method  to return incomplete state   
               set phypRunning false
hdwr_svr.log shows:
               WARNING: BE AWARE::Bad return code in the response
               package? (0x0006)

Upgrade of HMC to 4.4.0 for some communication problems

CMVC defect 441426 HMC managed system state incomplete when fsp and phyp seem to be communicating.

            Thread[Thread-43,5,main] invoke error, status is 1009                                                                      
            Thread[Thread-30,5,main] already got resp no waiting                                                                   
            Thread[Thread-30,5,main] t.status = 1009          
            Thread[Thread-30,5,main] Cmd did NOT time out!

Version Mismatch

Verify the output from commands listed shows correctly:
  • lshmc -n
  • lssysconn -r all
  • lssyscfg -r sys (fsp data)
  • lssyscfg -r frame (bpc data)
  • tail /opt/hsc/data/HmcNetConfig
  • tail /var/lib/dhcp/dhcpd.leases
  • Verify all ipaddresses are pingable

Some solutions have been:

  • CE re-seats the VPD (anchor) card
  • Toggle FSP dip switches
    Details in infocenter under adding a console, hmc
  • On 9119
    1. Hit UEPO switch for power off
    2. rmsysconn all connections
    3. mksysconn -o auto
    4. hmcshutdown -t 0 -r
    5. wait for HMC to come up
    6. Hit UEPO switch for power on.

For everything else:
Capture cimserver.log looking for corrupted or incorrect MTMS for system showing version mismatch.


Related Information

      InfoCenter (docs)
      All HMC info
      IBM Fix Central
      Microcode Downloads
      Support Guide
      Problem Management Online
      VIO Server Resources
      Redbook Abstract for Advanced POWER Virtualization on IBM eServer p5 Servers: Introduction and Basic Configuration
      Linux HOW-TOs

Call 800-CALL-AIX for help





UPDATE 2015-04-01

Modern DLPAR is a little different
* Instead of IBM.ManagedNode, it's now IBM.MngNode
* Instead of IBM.ManagementServer on the clients, it's now IBM.MCP
* lspartition -dlpar is looking for 0x14f9f on the dcaps.

I've run into several clean VIO installs, with RMC wide open, no firewalling, that simply does not connect.
A reset looks like this now:
root@vios2  /
# telnet hmc01 rmc
Trying...
Connected to hmc01.
Escape character is '^]'.
quit

telnet> quit
Connection closed.

root@vios2  /
# rmcctrl -z
/usr/sbin/rsct/install/bin/recfgct

root@vios2  /
# /usr/sbin/rsct/install/bin/recfgct
/usr/lib/dr/scripts/all/ctrmc_MDdr               DR script to refresh Management Domain configuration
0513-071 The ctcas Subsystem has been added.
0513-071 The ctrmc Subsystem has been added.
0513-059 The ctrmc Subsystem has been started. Subsystem PID is 11010218.

root@vios2  /
# rmcctrl -p

root@vios2  /
# lsrsrc "IBM.MCP"
Resource Persistent Attributes for IBM.MCP
resource 1:
        MNName            = "VIO's IP Address here"
        NodeID            = 10505315169999999999
        KeyToken          = "hmc01"
        IPAddresses       = {"HMC's IP Address Here"}
        ConnectivityNames = {"VIO's IP Address here"}
        HMCName           = "7042CR8*21FFFFF"
        HMCIPAddr         = "HMC's Public IP Address Here"
        HMCAddIPs         = "HMC's private IP addresses Here"
        HMCAddIPv6s       = ""
        ActivePeerDomain  = ""
        NodeNameList      = {vios2"}

root@vios2  /
#  /usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc
Management Domain Status: Management Control Points
  I A  0x91deadbeefcafe79  0001  HMC's IP Address Here

Then from the HMC, after a minute or two:
hscroot@hmc01:~> lspartition -dlpar
<#0> Partition:<2*8286-42A*21FFFFF, , 172.22.94.103>
       Active:<1>, OS:<AIX, 6.1, 6100-09-04-1441>, DCaps:<0x14f9f>, CmdCaps:<0x1b, 0x1b>, PinnedMem:<1247>
<#1> Partition:<1*8286-42A*21FFFFF, , 172.22.94.102>
       Active:<1>, OS:<AIX, 6.1, 6100-09-04-1441>, DCaps:<0x14f9f>, CmdCaps:<0x1b, 0x1b>, PinnedMem:<1098>

It takes about 5 minutes for DLPAR to come alive.
Subscribe
  • Post a new comment

    Error

    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 1 comment