Editing
VPS Management
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Recovering from a crash (linux) == === Diagnose whether you have a crash === The most important thing is to get the machine and all ve’s back up as soon as possible. Note the time, you’ll need to create a crash log entry (Mgmt. -> Reference -> CrashLog). The first thing to do is head over to the [[Screen#Screen_Organization|serial console screen]] and see if there’s any kernel error messages output. Try to copy any messages (or just a sample of repeating messages) you see into the notes section of the crash log – these will also likely need to be sent to virtuozzo for interpretation. If the messages are spewing too fast, hit ^O + H to start a screen log dump which you can ob1182.pts-38.bb serve after the machine is rebooted. Additionally, if the machine is responsive, you can get a trace to send to virtuozzo by hooking up a kvm and entering these 3 sequences: <pre>alt+print screen+m alt+print screen+p alt+print screen+t</pre> If there are no messages, the machine may just be really busy- wait a bit (5-10min) to see if it comes back. If it's still pinging, odds are its very busy. If it doesn't come back, or the messages indicate a fatal error, you will need to proceed with a power cycle (ctrl+alt+del will not work). === Power cycle the server === If this machine is not a Dell 2950 with a [[DRAC/RMM#DRAC|DRAC card]] (i.e. if you can’t ssh into the DRAC card and issue racadm serveraction hardreset, then you will need someone at the data center to power the macine off, wait 30 sec, then turn it back on. Make sure to re-attach via console (<tt>tip virtxx</tt>) immediately after power down. === (Re)attach to the console === Stay on the console the entire time during boot. As the BIOS posts- look out for the RAID card output- does everything look healthy? The output may be scrambled, look for "DEGRADED" or "FAILED". Once the OS starts booting you will be disconnected (dropped back to the shell on the console server) a couple times during the boot up. The reason you want to quickly re-attach is two-fold: 1. If you don’t reattach quickly then you won’t get any console output, 2. you want to be attached before the server ''potentially'' starts (an extensive) fsck. If you attach after the fsck begins, you’ll have seen no indication it started an fsck and the server will appear frozen during startup- no output, no response. === Start containers/VE's/VPSs === When the machine begins to start VE’s, it’s safe to leave the console and login via ssh. All virts should be set to auto start all the VEs after a crash. Further, most (newer) virts are set to “fastboot” it’s VE’s (to find out, do: grep -i fast /etc/sysconfig/vz and look for <tt>VZFASTBOOT=yes</tt>). If this was set prior to the machine’s crash (setting it after the machine boots will not have any effect until the vz service is restarted) it will start each ve as fast as possible, in serial, then go thru each VE (serially), shutting it down running a vzquota (disk usage) check, then bringing it back up. The benefit is that all VE’s are brought up quickly (within 15min or so depending on the #), the downside is a customer watching closely will notice 2 outages – 1st the machine crash, 2nd their quota check (which will be a much shorter downtime- on the order of a few minutes). Where “fastboot” is not set to yes (i.e on quar1), vz will start them consecutively, checking the quotas one at a time, and the 60th VE may not start until an hour or two later - this is not acceptable. The good news is, if you run vzctl start for a VE that is already started, you will simply get an error: <tt>VE is already started</tt>. Further, if you attempt to vzctl start a VE that is in the process of being started, you will simply get an error: unable to lock VE. So, there is no danger in simply running scripts to start smaller sets of VEs. If the system is not autostarting, then there is no issue, and even if it does, when it conflicts, one process (yours or the autostart) will lose, and just move on to the next one. A script has been written to assist with ve starts: [[#startvirt.pl|startvirt.pl]] which will start 6 ve’s at once until there are no more left. If startvirt.pl is used on a system where “fastboot” was on, it will circumvent the fastboot for ve’s started by startvirt.pl – they will go through the complete quota check before starting- therefore this is not advisable when a system has crashed. When a system is booted cleanly, and there's no need for vzquota checks, then startvirt.pl is safe and advisable to run. === Make sure all containers are running === You can quickly get a feel for how many ve’s are started by running: <pre>[root@virt4 log]# vs VEID 16066 exist mounted running VEID 16067 exist mounted running VEID 4102 exist mounted running VEID 4112 exist mounted running VEID 4116 exist mounted running VEID 4122 exist mounted running VEID 4123 exist mounted running VEID 4124 exist mounted running VEID 4132 exist mounted running VEID 4148 exist mounted running VEID 4151 exist mounted running VEID 4155 exist mounted running VEID 42 exist mounted running VEID 432 exist mounted running VEID 434 exist mounted running VEID 442 exist mounted running VEID 450 exist mounted running VEID 452 exist mounted running VEID 453 exist mounted running VEID 454 exist mounted running VEID 462 exist mounted running VEID 463 exist mounted running VEID 464 exist mounted running VEID 465 exist mounted running VEID 477 exist mounted running VEID 484 exist mounted running VEID 486 exist mounted running VEID 490 exist mounted running</pre> So to see how many ve’s have started: <pre>[root@virt11 root]# vs | grep running | wc -l 39</pre> And to see how many haven’t: <pre>[root@virt11 root]# vs | grep down | wc -l 0</pre> And how many we should have running: <pre>[root@virt11 root]# vs | wc -l 39</pre> Another tool you can use to see which ve’s have started, among other things is [[#vzstat|vzstat]]. It will give you CPU, memory, and other stats on each ve and the overall system. It’s a good thing to watch as ve’s are starting (note the VENum parameter, it will tell you how many have started): <pre>4:37pm, up 3 days, 5:31, 1 user, load average: 1.57, 1.68, 1.79 VENum 40, procs 1705: running 2, sleeping 1694, unint 0, zombie 9, stopped 0 CPU [ OK ]: VEs 57%, VE0 0%, user 8%, sys 7%, idle 85%, lat(ms) 412/2 Mem [ OK ]: total 6057MB, free 9MB/54MB (low/high), lat(ms) 0/0 Swap [ OK ]: tot 6142MB, free 4953MB, in 0.000MB/s, out 0.000MB/s Net [ OK ]: tot: in 0.043MB/s 402pkt/s, out 0.382MB/s 4116pkt/s Disks [ OK ]: in 0.002MB/s, out 0.000MB/s VEID ST %VM %KM PROC CPU SOCK FCNT MLAT IP 1 OK 1.0/17 0.0/0.4 0/32/256 0.0/0.5 39/1256 0 9 69.55.227.152 21 OK 1.3/39 0.1/0.2 0/46/410 0.2/2.8 23/1860 0 6 69.55.239.60 133 OK 3.1/39 0.1/0.3 1/34/410 6.3/2.8 98/1860 0 0 69.55.227.147 263 OK 2.3/39 0.1/0.2 0/56/410 0.3/2.8 34/1860 0 1 69.55.237.74 456 OK 17/39 0.1/0.2 0/100/410 0.1/2.8 48/1860 0 11 69.55.236.65 476 OK 0.6/39 0.0/0.2 0/33/410 0.1/2.8 96/1860 0 10 69.55.227.151 524 OK 1.8/39 0.1/0.2 0/33/410 0.0/2.8 28/1860 0 0 69.55.227.153 594 OK 3.1/39 0.1/0.2 0/45/410 0.0/2.8 87/1860 0 1 69.55.239.40 670 OK 7.7/39 0.2/0.3 0/98/410 0.0/2.8 64/1860 0 216 69.55.225.136 691 OK 2.0/39 0.1/0.2 0/31/410 0.0/0.7 25/1860 0 1 69.55.234.96 744 OK 0.1/17 0.0/0.5 0/10/410 0.0/0.7 7/1860 0 6 69.55.224.253 755 OK 1.1/39 0.0/0.2 0/27/410 0.0/2.8 33/1860 0 0 192.168.1.4 835 OK 1.1/39 0.0/0.2 0/19/410 0.0/2.8 5/1860 0 0 69.55.227.134 856 OK 0.3/39 0.0/0.2 0/13/410 0.0/2.8 16/1860 0 0 69.55.227.137 936 OK 3.2/52 0.2/0.4 0/75/410 0.2/0.7 69/1910 0 8 69.55.224.181 1020 OK 3.9/39 0.1/0.2 0/60/410 0.1/0.7 55/1860 0 8 69.55.227.52 1027 OK 0.3/39 0.0/0.2 0/14/410 0.0/2.8 17/1860 0 0 69.55.227.83 1029 OK 1.9/39 0.1/0.2 0/48/410 0.2/2.8 25/1860 0 5 69.55.227.85 1032 OK 12/39 0.1/0.4 0/80/410 0.0/2.8 41/1860 0 8 69.55.227.90</pre> When you are all done, you will want to make sure that all the VEs really did get started, run vs one more time. Note the time all ve’s are back up and enter that into and save the crash log entry. Occasionally, a ve will not start automatically. The most common reason for a ve not to come up normally is the ve was at it’s disk limit before the crash, and will not start since they’re over the limit. To overcome this, set the disk space to current usage level (the system will give this to you when it fails to start), start the ve, then re-set the disk space back to the prior level. Lastly, contact the customer to let them know they’re out of disk (or allocate more disk if they're entitled to more).
Summary:
Please note that all contributions to JCWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
JCWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information