Sun群集系统故障报告.docx
《Sun群集系统故障报告.docx》由会员分享,可在线阅读,更多相关《Sun群集系统故障报告.docx(14页珍藏版)》请在冰豆网上搜索。
Sun群集系统故障报告
XX市电视台
Sun及oracle数据库系统
故障报告
编号:
20071115
07年11月
四川XX信息产业有限责任公司
前言
2007年11月15日下午4时,系统工作不正常。
四川XX信息产业有限责任公司与XX电视台信息中心关技术人员一道对Sun服务器、2套磁盘阵列,两套数据库、Suncluster系统及RAC做了详细的信息收集、故障分析、故障排查。
现小节如下:
第一章平台构成
1.1硬件平台构成
包括2套SunFire490服务器、2套SunStoredge3310磁盘阵列构成。
拓扑图如下所示:
服务器配置
每台服务器SunFireV490配置:
CPU个数:
2 CPU类型:
SunUltraSparcIV双核CPU速度:
1050Mhz,L2Cache16MB/CPU,内存:
8GBMB,内置2个72GBFC-AL硬盘DVD光驱,4个千兆以太网口、一块双通道Ultra320SCSI卡。
2个电源系统。
磁盘阵列配置:
两套SunStorEdge3310SCSI盘阵,每套配置如下:
512MBCache的单阵列控制器,5X72GB10KrpmSCSI硬盘,分布通道0,ID号8~12,其中8~11做RAID5卷,12号盘为热备盘。
2个电源系统。
1.2操作系统平台:
系统运行64位SunSolaris9操作系统,SunOS5.9,OS内核版本Generic_118558-17。
1.3数据库平台:
运行oracle9.2.0.0企业版数据库。
RAC配置。
1.4双机软件
Suncluster3.1双机软件
第二章故障现象与分析
2.1现象描述
11月15日下午4时左右,用户反映数据库不能访问。
经确认IP为172.17.10.40的SunE490服务器cdtvdb1无法登陆。
IP为172.17.10.43的SunE490服务器cdtvdb2可登陆但数据库无法连接。
于是重启动cdtvdb2,数据库进程能正常启动,但用户仍反映不能连接数据库。
后在cdtvdb2上手动启动数据库监听后数据库用户连接正常。
2.2系统报错日志
主机cdtvdb1的相关报错日志如下:
Nov1513:
08:
26cdtvdb1cl_comm:
[ID705198kern.notice]NOTICE:
extract_pkt_info:
droppacket-IPV6unsupported
Nov1517:
09:
39cdtvdb1genunix:
[ID408822kern.info]NOTICE:
mpt0:
faultdetectedindevice;servicestillavailable
Nov1517:
09:
39cdtvdb1genunix:
[ID611667kern.info]NOTICE:
mpt0:
ConnectedcommandtimeoutforTarget1
Nov1517:
09:
39cdtvdb1scsi:
[ID107833kern.warning]WARNING:
/pci@8,600000/scsi@1(mpt0):
Nov1517:
09:
39cdtvdb1Target1reducingsync.transferrate
Nov1517:
09:
39cdtvdb1mpt:
[ID795936kern.warning]WARNING:
ID[SUNWpd.mpt.sync_wide_backoff.6014]
Nov1517:
09:
43cdtvdb1scsi:
[ID107833kern.warning]WARNING:
/pci@8,600000/scsi@1/sd@1,0(sd2):
Nov1517:
09:
43cdtvdb1ErrorforCommand:
write(10)ErrorLevel:
Retryable
Nov1517:
09:
43cdtvdb1scsi:
[ID107833kern.notice]RequestedBlock:
23371173ErrorBlock:
23371173
Nov1517:
09:
43cdtvdb1scsi:
[ID107833kern.notice]Vendor:
SUNSerialNumber:
238EDB0B-01
Nov1517:
09:
43cdtvdb1scsi:
[ID107833kern.notice]SenseKey:
UnitAttention
Nov1517:
09:
43cdtvdb1scsi:
[ID107833kern.notice]ASC:
0x29(poweron,reset,orbusresetoccurred),ASCQ:
0x0,FRU:
0x0
Nov1517:
10:
38cdtvdb1scsi:
[ID107833kern.notice]/pci@8,600000/scsi@1(mpt0):
Nov1517:
10:
38cdtvdb1gotexternalSCSIbusreset.
Nov1517:
12:
51cdtvdb1genunix:
[ID408822kern.info]NOTICE:
mpt0:
faultdetectedindevice;servicestillavailable
Nov1517:
12:
51cdtvdb1genunix:
[ID611667kern.info]NOTICE:
mpt0:
DisconnectedcommandtimeoutforTarget1
Nov1517:
13:
51cdtvdb1genunix:
[ID408822kern.info]NOTICE:
mpt0:
faultdetectedindevice;servicestillavailable
Nov1517:
13:
51cdtvdb1genunix:
[ID611667kern.info]NOTICE:
mpt0:
lostinterruptduringpolling-resettingcontroller
Nov1517:
13:
51cdtvdb1scsi:
[ID365881kern.info]/pci@8,600000/scsi@1(mpt0):
Nov1517:
13:
51cdtvdb1Rev.8LSI,Inc.1030found.
Nov1517:
13:
51cdtvdb1scsi:
[ID365881kern.info]/pci@8,600000/scsi@1(mpt0):
Nov1517:
13:
51cdtvdb1mpt0supportspowermanagement.
Nov1517:
13:
51cdtvdb1scsi:
[ID365881kern.info]/pci@8,600000/scsi@1(mpt0):
Nov1517:
13:
51cdtvdb1mpt0:
IOCOperational.
Nov1517:
13:
57cdtvdb1scsi:
[ID107833kern.warning]WARNING:
/pci@8,600000/scsi@1/sd@1,0(sd2):
Nov1517:
13:
57cdtvdb1ErrorforCommand:
readErrorLevel:
Retryable
Nov1517:
13:
57cdtvdb1scsi:
[ID107833kern.notice]RequestedBlock:
565008ErrorBlock:
565008
Nov1517:
13:
57cdtvdb1scsi:
[ID107833kern.notice]Vendor:
SUNSerialNumber:
238EDB0B-01
Nov1517:
13:
57cdtvdb1scsi:
[ID107833kern.notice]SenseKey:
UnitAttention
Nov1517:
13:
57cdtvdb1scsi:
[ID107833kern.notice]ASC:
0x29(poweron,reset,orbusresetoccurred),ASCQ:
0x0,FRU:
0x0
Nov1517:
13:
57cdtvdb1scsi:
[ID107833kern.notice]/pci@8,600000/scsi@1(mpt0):
Nov1517:
13:
57cdtvdb1gotexternalSCSIbusreset.
Nov1517:
13:
58cdtvdb1scsi:
[ID365881kern.info]/pci@8,600000/scsi@1(mpt0):
Nov1517:
13:
58cdtvdb1Loginfo11030000receivedfortarget1.
Nov1517:
13:
58cdtvdb1scsi_status=0,ioc_status=804b,scsi_state=8
Nov1517:
13:
58cdtvdb1scsi:
[ID107833kern.notice]/pci@8,600000/scsi@1(mpt0):
Nov1517:
13:
58cdtvdb1gotexternalSCSIbusreset.
Nov1517:
14:
01cdtvdb1scsi:
[ID107833kern.warning]WARNING:
/pci@8,600000/scsi@1/sd@1,0(sd2):
Nov1517:
14:
01cdtvdb1ErrorforCommand:
readErrorLevel:
Retryable
Nov1517:
14:
01cdtvdb1scsi:
[ID107833kern.notice]RequestedBlock:
565008ErrorBlock:
565008
Nov1517:
14:
01cdtvdb1scsi:
[ID107833kern.notice]Vendor:
SUNSerialNumber:
238EDB0B-01
Nov1517:
14:
01cdtvdb1scsi:
[ID107833kern.notice]SenseKey:
UnitAttention
Nov1517:
14:
01cdtvdb1scsi:
[ID107833kern.notice]ASC:
0x29(poweron,reset,orbusresetoccurred),ASCQ:
0x0,FRU:
0x0
Nov1517:
14:
16cdtvdb1genunix:
[ID408822kern.info]NOTICE:
mpt0:
faultdetectedindevice;servicestillavailable
Nov1517:
14:
16cdtvdb1genunix:
[ID611667kern.info]NOTICE:
mpt0:
ConnectedcommandtimeoutforTarget1
Nov1517:
14:
16cdtvdb1scsi:
[ID107833kern.warning]WARNING:
/pci@8,600000/scsi@1(mpt0):
Nov1517:
14:
16cdtvdb1Target1revertingtoasync.mode
Nov1517:
14:
16cdtvdb1mpt:
[ID675377kern.warning]WARNING:
ID[SUNWpd.mpt.sync_wide_backoff.6013]
Nov1517:
14:
19cdtvdb1scsi:
[ID107833kern.warning]WARNING:
/pci@8,600000/scsi@1/sd@1,0(sd2):
Nov1517:
14:
19cdtvdb1ErrorforCommand:
readErrorLevel:
Retryable
Nov1517:
14:
19cdtvdb1scsi:
[ID107833kern.notice]RequestedBlock:
565008ErrorBlock:
565008
Nov1517:
14:
19cdtvdb1scsi:
[ID107833kern.notice]Vendor:
SUNSerialNumber:
238EDB0B-01
Nov1517:
14:
19cdtvdb1scsi:
[ID107833kern.notice]SenseKey:
NoAdditionalSense
Nov1517:
14:
19cdtvdb1scsi:
[ID107833kern.notice]ASC:
0x0(),ASCQ:
0x29,FRU:
0x0
Nov1517:
14:
34cdtvdb1genunix:
[ID408822kern.info]NOTICE:
mpt0:
faultdetectedindevice;servicestillavailable
Nov1517:
14:
34cdtvdb1genunix:
[ID611667kern.info]NOTICE:
mpt0:
ConnectedcommandtimeoutforTarget1
Nov1517:
14:
38cdtvdb1scsi:
[ID107833kern.warning]WARNING:
/pci@8,600000/scsi@1/sd@1,0(sd2):
Nov1517:
14:
38cdtvdb1ErrorforCommand:
readErrorLevel:
Retryable
2.3现象分析
1、从报错日志看,cdtvdb1于Nov1517:
09:
39报genunix:
[ID611667kern.info]NOTICE:
mpt0:
ConnectedcommandtimeoutforTarget1错,造成数据库的数据文件无法使用,Suncluster软件试着重启动cdtvdb1,但启动时遇到以下错误信息而挂起。
所以重新启动未成功。
2、当时cdtvdb2无法使用,从后面的分析可判断为存储阵列数据不可用造成的。
第三章故障排查
采用对比、排除法,根据信息提示,排除了HBA卡的原因。
后来重新启动两台3310磁盘阵列后,再重新启动两台E490服务器后全系统恢复正常。
第四章故障定性与定位
查阅的相关bug资料如下
DocumentAudience:
SPECTRUM
DocumentID:
80089
Title:
SCSIbusreset&transporterrorsonSunStorEdge[TM]3310SCSIarraywithUltra320SCSIHBA
CopyrightNotice:
Copyright©2007SunMicrosystems,Inc.AllRightsReserved
UpdateDate:
ThuJun2800:
00:
00MDT2007
Products:
SunStorageTekUltra320LVDPCIHostBusAdapter,SunStorageTek3310SCSIArray
TechnicalAreas:
Driver,SCSIHostBusAdapter(HBA),Firmware
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Keyword(s):
ultra320,transporterrors,SE3310,JBOD,SCSIbusreset,mpt.conf,minnow,Jasper320,3310,Jbod
ProblemStatement:
Top
OnsystemsconnectedtoaSunStorEdge[TM]3310SCSIarrayorJBODviaanUltra320SCSIHBA,youmayseesomeofthesemessages:
gotexternalSCSIbusreset
Targetxreducingsync.transferrate
Targetxrevertingtoasync.mode
Thesemessagescouldfurtherbefollowedbyread/writeerrorssuchas:
Jan1604:
04:
21larempdbscsi:
[ID107833kern.warning]WARNING:
/pci@1f,4000/scsi@4/sd@0,5(sd76):
Jan1604:
04:
21larempdbErrorforCommand:
writeErrorLevel:
Retryable
Jan1604:
04:
21larempdbscsi:
[ID107833kern.notice]RequestedBlock:
153264ErrorBlock:
153264
Jan1604:
04:
21larempdbscsi:
[ID107833kern.notice]Vendor:
SUNSerialNumber:
1ACACDB1-05
Jan1604:
04:
21larempdbscsi:
[ID107833kern.notice]SenseKey:
UnitAttention
Jan1604:
04:
21larempdbscsi:
[ID107833kern.notice]ASC:
0x29(poweron,reset,orbusresetoccurred),ASCQ:
0x0,FRU:
0x0
Theabovemessagescouldbearesultof:
SpeednegotiationbetweentheUltra320SCSIHBAandtheSunStorEdge3310SCSIarray.
Youcancheckthespeedtoeachscsitargetwith:
#prtpicl-v|egrep"NAME=|sync-speed"|grep-vspindle
|NAME=ide-controller|
|NAME=mpt0|
:
target0-sync-speed320000
:
target8-sync-speed320000
:
target9-sync-speed320000
:
targeta-sync-speed320000
:
targetb-sync-speed320000
|NAME=mpt1|
|NAME=glm0|
:
target0-sync-speed160000
:
target1-sync-speed160000
|NAME=glm1|
:
target0-sync-speed160000
:
target8-sync-speed160000
:
target9-sync-speed160000
:
targeta-sync-speed160000
:
targetb-sync-speed160000
InthisexampletheonboardcontrollerofaSunFire[TM]V240Server(glm1)isaccessingthedisksinoneSunStorEdge3310JBODat160MB/s,whiletheUltra320HBA(mpt0)isusingaspeedof320MB/stotheotherSunStorEdge3310JBOD.
Insuchcases,thesystemmayrecoveronitsown,howeverperformancewillbedegradeduntilthen.Itcanalsohappenthatoneormoredisksfails,becomesunavailableinformatandcausesmessagestobelogged,suchas:
disknotrespondingtoselection
Resolution:
Top
Thefollowingcourseofactionisrecommended.
ThisproblemcanbefixedbythrottlingthespeedoftheUltra320HBAinthemptdriverconfigurationfiletothemaximumspeedsupportedbytheSunStorEdge3310.ItisalsostronglyrecommendedtoupgradetheSunStorEdge3310controllerand/orSAF-TEfirmwaretothelatestversion.
TolimittheHBAspeedto160MB/s,createthefile/kernel/drv/mpt.confwiththefo