Josh 2014

oslevel wrong

I always forget instfix and oslevel -rl....
tags: aix oslevel incorrect backlevel wrong upgrade update

When these things show nothing:
lppchk -v
oslevel -sl `oslevel -sq 2>/dev/null | head -1`

and yout bos.rte.install, and bos.mp64, show the correct level compared to:
https://www-304.ibm.com/support/docview.wss?uid=isg1fileset2063572681

You should see the correct level here as well:
oslevel -sq | head

Check these other two things.
oslevel -r -l `oslevel -rq 2>/dev/null | sed -n '1p'`
and
instfix -icqk 6100-09-06-1543 | grep ":-:"
Josh 2014

broken bos.rte.security

In a couple of instances, I've found bos.rte.* filesets broken during upgrade, perhaps with the root part missing.

It's always a pain, and I always forget how to fix it.

The problem is that the AIX base media does not include base install images for these. They are S (single) updates instead of I (install) images. This is because, during install, a bff called "bos" is laid down first, and that includes 10-20 core filesets, /usr, /, and all the core stuff. It's basically a prototype mksysb. Sort of.

Anyway, in rare instances, when there is a known defect, IBM will release a fileset as a patch through support/ztrans to get you fixed. If you don't have time to wait, or if you are a biz partner, working with a customer who hasn't yet approved you using their support, then you might have to fix it yourself.

list of errorsCollapse )
The solution was ODM surgery.

First, I took a mksysb and copied it to somewhere safe (another server with NIM installed).

Then, I looked into ODM, and found /etc/objrepos/product was missing the entry for this version.
You might be able to copy from /usr/lib/objrepos, but I copied from a valid clone of this system.

# export ODMDIR=/etc/objrepos
# ssh goodserver odmget -q lpp_name=bos.rte.security product | odmadd


Then, I needed to add the history line, which was identical between root and usr:
# odmget -q name=bos.rte.security lpp     (note the lpp_id)
# ODMDIR=/usr/lib/objrepos odmget -q lpp_name=39 history | ODMDIR=/etc/objrepos odmadd


The "inventory" ODM is accessed with lpp_name also, but that had a long list of files already. I did not mess with any of that.

Now, install_all_updates from my TLSP worked fine.
Josh 2014

MIMIX for AIX - Misc Troubleshooting

Because there is NOTHING on the web about this.
PRODUCT: Recover Now / Double Take / MIMIX / EchoStream

Vision Solutions bought Double-Take. Double-Take wrote Recover Now, which is called MIMIX on AS-400. The replication tools underneath are called "EchoStream".

NOTE: Documentation is hard to find, but here is a shortened URL form of the Windows docs: http://omnitech.net/u/rn35docs

Most functions can be managed from the web GUI:
http://127.0.0.1:8410/ui/portal
Obviously, put your correct IP here if you are not on the same host.

Install Licenses


Stop the license manager on PRIMARY:
stopsrc -cs scrt_lca-1

Stop the license manager on BACKUP:
stopsrc -cs scrt_aba-1

Copy the new license files:
scp -rp NIMSERVER:/export/Vision/license.perm/*_`hostname`_ES_node_license.properties /usr/scrt/run/node_license.properties

Start the license manager on PRIMARY:
startsrc -s scrt_lca-1

Start the license manager on BACKUP:
startsrc -s scrt_aba-1

Define initial contexts


/usr/scrt/bin/rtdr -C PRIMARYID (usually 1) -F BACKUPID (usually 1010) setup

Query RN Contexts


Context 1 is Primary. DR shows as BACKUP to this, and prod shows PRODUCTION for this.
Context 101 is Recovery. DR shows PRIMARY for this, and prod shows BACKUP for this.

root@BACKUPNODE
/usr/scrt/bin/sccfgd_getctxs
HOSTID HEXNUMBER
IPADDRESS MULTIPLELINES
BACKUP 1
PRODUCTION 101

root@PRODUCTIONNODE
# /usr/scrt/bin/sccfgd_getctxs
HOSTID DIFFERENTHEXNUMBER
IPADDRESS MULTIPLELINES
PRODUCTION 1
BACKUP 101


Uninstall EchoStream


/usr/scrt/bin/scsetup -R -C1
/usr/scrt/bin/sclist -DD -C1
odmdelete -o SCCuAt
odmdelete -o SCCuObj
odmdelete -o SCCuRel



NORMAL OPERATIONS


### EchoStream start
/usr/scrt/bin/rtstart -C1

### RN Check to see if kernel module is loaded
/usr/scrt/bin/scconfig -sC1

### RN Check if services are online
lssrc -a | grep scrt
scrt_lca-1 sender
scrt_aba-101 is receiver


### Protected filesystem mount
NOTE: This is usually handled by rtstart.
/usr/scrt/bin/rtmnt -C1

### Protected filesystem umount
NOTE: This is usually handled by rtstop.
/usr/scrt/bin/rtumnt -C1

### EchoStream sync, stop, and unload service
/usr/scrt/bin/rtstop -SC1

### EchoStream stop & unload service
/usr/scrt/bin/rtstop -C1

### EchoStream stop & unload kernel extension
/usr/scrt/bin/rtstop -FC1

### Check dirty blocks in state map
This will show how many blocks need to be sync'd for the recovery group:
/usr/scrt/bin/scconfig -PC1

### RN List buffer utilization
NOTE: When the local buffer overflows, just reverts to state-map tracking withour point-in-time recovery.
/usr/scrt/bin/esmon 1

### Shutdown all contexts
NOTE: This can be added to /etc/rc.shutdown, or in cluster start/stop scripts.
/usr/scrt/bin/rn_shutdown


FAILOVER PROCEDURES


Much is missing here. This is what I could find on the internet.
You can also do this from the WebUI.

### Fail back to Primary Server
/usr/scrt/bin/rtdr -qC 1 failback

### Failover to Recovery Server
/usr/scrt/bin/rtdr -qC 101 failback

### Make clone of filesystem
/usr/scrt/bin/scrt_ra -C1 -X

### Release clone of filesystem
/usr/scrt/bin/scrt_ra -C1 -W -L /dev/dbfs01lv

MANUAL OPERATIONS


### RN Primary Manual start
In troubleshooting and testing, these commands can start Recover Now manually:
/opt/visionsolutions/http/vsisvr/httpsvr/bin/strvsisvr
varyonvg rnvspvg
/usr/bin/startsrc -s scrt_scconfigd
/usr/scrt/bin/rtstart -C1


### Start without mount and fsck
/usr/scrt/bin/rtstart -C1 -M

### RN Primary Manual stop
In troubleshooting and testing, these commands will stop Recover Now manually:

# Unmount the protected filesystems
/usr/scrt/bin/rtumnt -DC1 | tee -a $log

# Kill processes if the filesystem is still mounted.
for i in `/usr/scrt/bin/sclist -C1 -f` ; do
mount | grep $i
if [[ $? -eq 0 ]]; then
fuser -kxuc $i
fi
done


# Try rtumnt again due to some timing issues observed.
sleep 3
/usr/scrt/bin/rtumnt -DC1


# Sync outstanding lfc's to DR server
/usr/sbin/sync
/usr/scrt/bin/scconfig -SC1


# Stop RecoverNow
/usr/scrt/bin/rtstop -FkC1

Recover Now Reset State Map


This will cause the entire recovery group to be resync'd as if new, clearing any rollback points.

First, manually stop all resources first, as listed above, then bring the context online:
varyonvg rnvspvg
/usr/scrt/bin/scconfig -MC1


### RTDR Resync
# Remote of prod from DR
/usr/scrt/bin/sccfgd_cmd -H PRODNODE -T "1 resync"

# Local on DR
/usr/scrt/bin/rtdr -qC101 resync

### Mount the filesystems on Primary
/usr/scrt/bin/rtmnt -C1

### Mount the filesystems on Recovery
/usr/scrt/bin/rtmnt -C101

### Unmount filesystems
/usr/scrt/bin/rtumnt -C1 # or -C 101

Recover Now Release Stuck Config


For errors such as:
scsmutil: log anchor cksum mismatch
ERROR: Failed to load EchoStream Production Server Drivers
ERROR: Drivers not loaded... Will not mount into an unprotected state

Clear the error:
/usr/scrt/bin/scsetup -MC1
/usr/scrt/bin/scconfig -uC1


Then you can use rtstart as normal.



FIX HOSTID CHANGED


### Start Recover Now
/opt/visionsolutions/http/vsisvr/httpsvr/bin/strvsisvr
varyonvg rnvspvg
/usr/bin/startsrc -s scrt_scconfigd
/usr/scrt/bin/rtstart -C1
Context not properly defined on this system

# /usr/scrt/bin/sccfgd_getctxs
HOSTID (new hostid)
IPADDRESS (multiple lines)
No context for production or backup listed

#/usr/scrt/bin/rtdr -C 1 -F 101 setup
/usr/scrt/bin/rtdr[14]: test: argument expected
rtdr: Configuration error -
rtdr: Primary Context ID <1> is not enabled.
rtdr: The Primary Context ID <1> must be enabled
rtdr: when creating a Failover Context ID.


### Shutdown the context
# /usr/scrt/bin/scsetup -MC1
scsetup: AET_TMO_NOVOTE: Setup failed.
scsetup: Detail: On wrong host.

# /usr/scrt/bin/scconfig -uC1
scconfig: AET_TMO_NOVOTE: Unexpected error
scconfig: Detail: On wrong host.

# cat /usr/scrt/run/node_license.properties
## begin signed data
#DoW Mon DD HH:MM:SS CDT YYYY
vision.license.customer=Company_name_with_underscores
vision.license.productname=EchoStream
vision.license.expirydatemig=YYYY-MM-DD HH\:MM\:SS
vision.license.machineid=0123456789abcdefghijLMNOPQR\=
vision.license.hostname=hostname


### Vision support is via:
RecoverNow/GeoCluster AIX, Replicate1 24x7 CustomerCare Technical Support:
U.S. and Canada: (800) 337-8214
International: +1 (949) 724-5465
CustomerCare Support Email: support@visionsolutions.com

After hours will just page out, but not make a ticket.
Email will have a ticket created within a few minutes.

### Test startup
# /usr/scrt/bin/scsetup
scsetup: AET_TMO_NOVOTE: Setup failed.
scsetup: Detail: On wrong host.


### Set path properly
cat <<'EOF' >> /etc/environment
export PATH=/usr/scrt/bin:$PATH
EOF


### Collect reference info from "production" node and "backup" node.
/usr/scrt/bin/scconfig -v
/usr/scrt/bin/scconfig -q
/usr/scrt/bin/rtattr -C1 -a HostId
/usr/scrt/bin/rtattr -C101 -a HostId
/usr/scrt/bin/rthostid


### Update hostid for changed production node
HOSTID=`rthostid`
/usr/scrt/bin/rtattr -C1 -a HostId -o production -v $HOSTID
/usr/scrt/bin/rtattr -C101 -a HostId -o backup -v $HOSTID
ssh BACKUPNODE /usr/scrt/bin/rtattr -C1 -a HostId -o production -v $HOSTID


### Re-collect all of the same reference data as above.

### Reconfigure the repository
scconfig -sC1
ssh BACKUPNODE /usr/scrt/bin/rtdr -C1 -F101 setup
/usr/scrt/bin/rtdr -C1 -F101 setup


### Restart everything
/usr/scrt/bin/rtstart -C1 && startsrc -s scrt_scconfigd
until df -k /databasedir 2>/dev/null >/dev/null ; do date ; sleep 10 ; done
/opt/visionsolutions/http/vsisvr/httpsvr/bin/strvsisvr 2>/dev/null
Josh 2004 Happy

Recovery From A Deleted /dev Directory in AIX

This is from AIXMIND, on March 26, 2010 2:46 pm
It doesn't show up high enough in search queries, so I'm duplicating it here.
Note that AIX recreates most of /dev on boot, but we need a certain amount.

===============================================
Problem(Abstract): The /dev directory was accidentally deleted.
===============================================
Symptom: System wont boot
===============================================
Environment: AIX 5.3 (and others)
==============================================
Resolving the problem

Boot system into maintenance mode.
Access a root volume group before mounting filesystems

mount /dev/hd4 /mnt
mount /dev/hd2 /mnt/usr
mknod /mnt/dev/hd1 b 10 8
mknod /mnt/dev/hd2 b 10 5
mknod /mnt/dev/hd3 b 10 7
mknod /mnt/dev/hd4 b 10 4
mknod /mnt/dev/hd5 b 10 1
mknod /mnt/dev/hd6 b 10 2
mknod /mnt/dev/hd8 b 10 3
mknod /mnt/dev/hd9var b 10 6
umount /mnt/usr
umount /mnt
shutdown -Fr

source: http://www.aixmind.com/?p=728
Ref: http://wp.me/p3ecOp-zh
Josh 2004 Happy

TDP for VMWare installer and IBM support rant

yAy. More TSM issues. There's defect in TSM client 7.1, but IBM says it only exists in 6.3 and 6.4. There's a patch level about 2 weeks old (7.1.0.2), but there are no lists of what's fixed in this patch.

I installed this, plus 7.1.0.1 of the VMWare agent, and the node that runs the GUI failed to update. The installer won't uninstall, reinstall, or repair. Windows uninstall just runs their installer.

IBM says that I should not call them, but I should use their simple website to submit a problem report. To report the defect, I have to use my customer's ID. I can't do this as a business partner unless I have my own support contract, beyond the money we pay to have access to support and software on a yearly basis.

To have access to a customer's ID, I have to wait for approval, of course. Beyond that, it shows up in a list that simply says "United States". So if I have, say 10 customers, I have no idea which number is for which customer.

Also, when selecting the product I want, there is no tree. It's a JAVA APPLET which has a list of products. I can search, but the naming is not consistent. Some say "Tivoli Storage Manager" and some say "TSM". Even for different versions of the same product, this naming difference occurs.

When I find it, it says that there will be a delay if I chose this product. Am I sure I want to chose this product? WTF?

There are no places to report any of these errors through the support organization, and no links on the pages to report them either. I have to report them to a general form 8 links away that may or may not be able to help.

Ginny Rometty is so focused on cutting cost to boost stock prices so her stock options at company exit have value, that she's downright gutting the infrastructure required for things like quality assurance and customer usability. Yes, everything is being updated for usability, but if it has worse functionality, or breaks entirely, then it's not REALLY a usability update.

Anyway, after supper, I'm going to call on the phone and listen to all of the messages telling me how easy it is to open a support request online, and that I should hang up and visit the web instead of wasting their dollars to fill out a new PMR that takes them months of training to be able to almost figure out. Then, I'll wait for an email because they don't ever call back anymore, and haven't been live-call-in for years.

The email will ask me to uninstall the software and try again. I won't be able to preemptively tell them anything in advance because I won't have online access to the PMR because I'm waiting for approval and then I have to remember which customer number to look under.
Josh 2004 Happy

Reference: Hung AIX print jobs

Description: All print queues are hung.

Solution: Kill off all hung print IO processes
This should resolve anything other than a printer offline.
If you cannot ping the printer, then that has to be fixed first.

stopsrc -g spool
sleep 60
ps auxww | grep pio

kill everything listed
Use kill -9 if needed.

cd /var/spool/lpd/qdir
ls -alt
remove any bad or old jobs listed that are not needed (usually anything over a few hours or days)

cd /var/spool/lpd/stat
ls -alt
remove any stale status files listed (or all of them and they will regenreate.

ls /etc/q*
verify qconfig.bin is equal or newer in date from qconfig. If not, remove qconfig.bin

Restart the print subsystem
startsrc -g spooler

All should be well.
Try using enq or lpr to print a job.
lpstat -p printername to list jobs.
lpstat with no flags will list all print queues, but will hang on unpingable printers.
Josh 2004 Happy

IBM FTP sites

More and more are moving to access through fixcentral, but for now, HMC recovery media is at:

ftp://public.dhe.ibm.com/software/server/hmc/

That looks to be the same as service.software and ftp.boulder.
Josh 2004 Happy

Operations Center Default URL

This is to prime search engines with something I couldn't find last time I looked.

The new TSM 7.1 operation center, also TSM 6.3.4.200, also called TSM Operations Center, or internally even TSM Control Center...

The default URL is https://xxx.xxx.xxx.xxx:11090/oc/

The port wasn't in the docs, and the path wasn't in some of the docs. None of it came up Googling.

So frustrating. I had a screen scrape of my install, but somehow missed this, or maybe it's only under advanced?

I couldn't find it in other docs, but Will was able to help me out.
Josh 2004 Happy

TSM.PWD permissions

In the past, I set up TSM.PWD as root, but this seems to not be what I needed.

I'm posting because the error messages and IBM docs don't cover this.

tsmdbmgr.log shows:
ANS2119I An invalid replication server address return code rc value = 2 was received from the server.

TSM Activity log shows:
ANR2983E Database backup terminated due to environment or setup issue related to DSMI_DIR - DB2 sqlcode -2033 sqlerrmc 168. (SESSION: 1, PROCESS: 9)

db2diag.log shows:

2014-02-26-13.54.12.425089-360 E415619A371 LEVEL: Error
PID : 15138852 TID : 1 PROC : db2vend
INSTANCE: tsminst1 NODE : 000
HOSTNAME: tsmserver
EDUID : 1
FUNCTION: DB2 UDB, database utilities, sqluvint, probe:321
DATA #1 : TSM RC, PD_DB2_TYPE_TSM_RC, 4 bytes
TSM RC=0x000000A8=168 -- see TSM API Reference for meaning.

EDUID : 38753 EDUNAME: db2med.35926.0 (TSMDB1) 0
FUNCTION: DB2 UDB, database utilities, sqluMapVend2MediaRCWithLog, probe:656
DATA #1 : String, 134 bytes
Vendor error: rc = 11 returned from function sqluvint.
Return_code structure from vendor library /tsm/tsminst1/sqllib/adsm/libtsm.a:

DATA #2 : Hexdump, 48 bytes
0x0A00030462F0C4D0 : 0000 00A8 3332 3120 3136 3800 0000 0000 ....321 168.....
0x0A00030462F0C4E0 : 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0A00030462F0C4F0 : 0000 0000 0000 0000 0000 0000 0000 0000 ................

EDUID : 38753 EDUNAME: db2med.35926.0 (TSMDB1) 0
FUNCTION: DB2 UDB, database utilities, sqluMapVend2MediaRCWithLog, probe:696
MESSAGE : Error in vendor support code at line: 321 rc: 168

RC 168 per dsmrc.h means:
#define DSM_RC_NO_PASS_FILE 168 /* password file needed and user is
not root */

Verified everything required for this:
• passworddir points to the right directory
• DSMI_DIR points to the right directory
• dsmtca runs okay
• dsmapipw runs okay

Verified hostname info was correct

dsmffdc.log shows:
[ FFDC_GENERAL_SERVER_ERROR ]: (rdbdb.c:4200) GetOtherLogsUsageInfo failed, rc=2813, archLogDir = /tsm/arch.

Checked, and the log directory inside dsmserv.opt was typoed as /tsm/arch instead of /tsm/arc as was used to create the instance and as exists on the filesystems.

Updated dsmserv.opt and restarted tsm server. No change other than fixing Q LOG

SOLUTION:
The TSM.PWD file must be owned by the instance user, not by root.
Make sure to run the dsmapipw as the instance user, or chown the file after.

Simple, and fairly obvious, but maybe not always so obvious.