Editing VPS Management (section)

== Recovering from a crash (linux) ==

=== Diagnose whether you have a crash ===
The most important thing is to get the machine and all ve’s back up as soon as possible. Note the time, you’ll need to create a crash log entry (Mgmt. -> Reference -> CrashLog). The first thing to do is head over to the [[Screen#Screen_Organization|serial console screen]] and see if there’s any kernel error messages output. Try to copy any messages (or just a sample of repeating messages) you see into the notes section of the crash log – these will also likely need to be sent to virtuozzo for interpretation. If the messages are spewing too fast, hit ^O + H to start a screen log dump which you can ob1182.pts-38.bb serve after the machine is rebooted. Additionally, if the  machine is responsive, you can get a trace to send to virtuozzo by hooking up a kvm and entering these 3 sequences:
<pre>alt+print screen+m
alt+print screen+p
alt+print screen+t</pre>

If there are no messages, the machine may just be really busy- wait a bit (5-10min) to see if it comes back. If it's still pinging, odds are its very busy. If it doesn't come back, or the messages indicate a fatal error, you will need to proceed with a power cycle (ctrl+alt+del will not work).

=== Power cycle the server ===
If this machine is not a Dell 2950 with a [[DRAC/RMM#DRAC|DRAC card]] (i.e. if you can’t ssh into the DRAC card and issue racadm serveraction hardreset, then you will need someone at the data center to power the macine off, wait 30 sec, then turn it back on.  Make sure to re-attach via console (<tt>tip virtxx</tt>) immediately after power down. 

=== (Re)attach to the console ===
Stay on the console the entire time during boot. As the BIOS posts- look out for the RAID card output- does everything look healthy? The output may be scrambled, look for "DEGRADED" or "FAILED". Once the OS starts booting you will be disconnected (dropped back to the shell on the console server) a couple times during the boot up. The reason you want to quickly re-attach is two-fold: 1. If you don’t reattach quickly then you won’t get any console output, 2. you want to be attached before the server ''potentially'' starts (an extensive) fsck. If you attach after the fsck begins, you’ll have seen no indication it started an fsck and the server will appear frozen during startup- no output, no response. 

=== Start containers/VE's/VPSs ===
When the machine begins to start VE’s, it’s safe to leave the console and login via ssh. All virts should be set to auto start all the VEs after a crash. Further, most (newer) virts are set to “fastboot” it’s VE’s (to find out, do:
 grep -i fast /etc/sysconfig/vz 
and look for <tt>VZFASTBOOT=yes</tt>). If this was set prior to the machine’s crash (setting it after the machine boots will not have any effect until the vz service is restarted) it will start each ve as fast as possible, in serial, then go thru each VE (serially), shutting it down running a vzquota (disk usage) check, then bringing it back up. The benefit is that all VE’s are brought up quickly (within 15min or so depending on the #), the downside is a customer watching closely will notice 2 outages – 1st the machine crash, 2nd their quota check (which will be a much shorter downtime- on the order of a few minutes). 

Where “fastboot” is not set to yes (i.e on quar1), vz will start them consecutively, checking the quotas one at a time, and the 60th VE may not start until an hour or two later - this is not acceptable.

The good news is, if you run vzctl start for a VE that is already started, you will simply get an error: <tt>VE is already started</tt>.  Further, if you attempt to vzctl start a VE that is in the process of being started, you will simply get an error: unable to lock VE.  So, there is no danger in simply running scripts to start smaller sets of VEs.  If the system is not autostarting, then there is no issue, and even if it does, when it conflicts, one process (yours or the autostart) will lose, and just move on to the next one.

A script has been written to assist with ve starts: [[#startvirt.pl|startvirt.pl]] which will start 6 ve’s at once until there are no more left.  If startvirt.pl  is used on a system where “fastboot” was on,  it will circumvent the fastboot for ve’s started by startvirt.pl – they will go through the complete quota check before starting- therefore this is not advisable when a system has crashed. When a system is booted cleanly, and there's no need for vzquota checks, then startvirt.pl is safe and advisable to run.

=== Make sure all containers are running ===
You can quickly get a feel for how many ve’s are started by running:

<pre>[root@virt4 log]# vs
VEID 16066 exist mounted running
VEID 16067 exist mounted running
VEID 4102 exist mounted running
VEID 4112 exist mounted running
VEID 4116 exist mounted running
VEID 4122 exist mounted running
VEID 4123 exist mounted running
VEID 4124 exist mounted running
VEID 4132 exist mounted running
VEID 4148 exist mounted running
VEID 4151 exist mounted running
VEID 4155 exist mounted running
VEID 42 exist mounted running
VEID 432 exist mounted running
VEID 434 exist mounted running
VEID 442 exist mounted running
VEID 450 exist mounted running
VEID 452 exist mounted running
VEID 453 exist mounted running
VEID 454 exist mounted running
VEID 462 exist mounted running
VEID 463 exist mounted running
VEID 464 exist mounted running
VEID 465 exist mounted running
VEID 477 exist mounted running
VEID 484 exist mounted running
VEID 486 exist mounted running
VEID 490 exist mounted running</pre>

So to see how many ve’s have started:
<pre>[root@virt11 root]# vs | grep running | wc -l
     39</pre>

And to see how many haven’t:
<pre>[root@virt11 root]# vs | grep down | wc -l
     0</pre>

And how many we should have running:
<pre>[root@virt11 root]# vs | wc -l
     39</pre>

Another tool you can use to see which ve’s have started, among other things is [[#vzstat|vzstat]]. It will give you CPU, memory, and other  stats on each ve and the overall system. It’s a good thing to watch as ve’s are starting (note the VENum parameter, it will tell you how many have started):

  <pre>4:37pm, up 3 days,  5:31,  1 user, load average: 1.57, 1.68, 1.79
VENum 40, procs 1705: running 2, sleeping 1694, unint 0, zombie 9, stopped 0
CPU [ OK ]: VEs  57%, VE0   0%, user   8%, sys   7%, idle  85%, lat(ms) 412/2
Mem [ OK ]: total 6057MB, free 9MB/54MB (low/high), lat(ms) 0/0
Swap [ OK ]: tot 6142MB, free 4953MB, in 0.000MB/s, out 0.000MB/s
Net [ OK ]: tot: in  0.043MB/s  402pkt/s, out  0.382MB/s 4116pkt/s
Disks [ OK ]: in 0.002MB/s, out 0.000MB/s

  VEID ST    %VM     %KM         PROC    CPU     SOCK FCNT MLAT IP
     1 OK 1.0/17  0.0/0.4    0/32/256 0.0/0.5 39/1256    0    9 69.55.227.152
    21 OK 1.3/39  0.1/0.2    0/46/410 0.2/2.8 23/1860    0    6 69.55.239.60
   133 OK 3.1/39  0.1/0.3    1/34/410 6.3/2.8 98/1860    0    0 69.55.227.147
   263 OK 2.3/39  0.1/0.2    0/56/410 0.3/2.8 34/1860    0    1 69.55.237.74
   456 OK  17/39  0.1/0.2   0/100/410 0.1/2.8 48/1860    0   11 69.55.236.65
   476 OK 0.6/39  0.0/0.2    0/33/410 0.1/2.8 96/1860    0   10 69.55.227.151
   524 OK 1.8/39  0.1/0.2    0/33/410 0.0/2.8 28/1860    0    0 69.55.227.153
   594 OK 3.1/39  0.1/0.2    0/45/410 0.0/2.8 87/1860    0    1 69.55.239.40
   670 OK 7.7/39  0.2/0.3    0/98/410 0.0/2.8 64/1860    0  216 69.55.225.136
   691 OK 2.0/39  0.1/0.2    0/31/410 0.0/0.7 25/1860    0    1 69.55.234.96
   744 OK 0.1/17  0.0/0.5    0/10/410 0.0/0.7  7/1860    0    6 69.55.224.253
   755 OK 1.1/39  0.0/0.2    0/27/410 0.0/2.8 33/1860    0    0 192.168.1.4
   835 OK 1.1/39  0.0/0.2    0/19/410 0.0/2.8  5/1860    0    0 69.55.227.134
   856 OK 0.3/39  0.0/0.2    0/13/410 0.0/2.8 16/1860    0    0 69.55.227.137
   936 OK 3.2/52  0.2/0.4    0/75/410 0.2/0.7 69/1910    0    8 69.55.224.181
  1020 OK 3.9/39  0.1/0.2    0/60/410 0.1/0.7 55/1860    0    8 69.55.227.52
  1027 OK 0.3/39  0.0/0.2    0/14/410 0.0/2.8 17/1860    0    0 69.55.227.83
  1029 OK 1.9/39  0.1/0.2    0/48/410 0.2/2.8 25/1860    0    5 69.55.227.85
  1032 OK  12/39  0.1/0.4    0/80/410 0.0/2.8 41/1860    0    8 69.55.227.90</pre>

When you are all done, you will want to make sure that all the VEs really did get started, run vs one more time.

Note the time all ve’s are back up and enter that into and save the crash log entry.

Occasionally, a ve will not start automatically. The most common reason for a ve not to come up normally is the ve was at it’s disk limit before the crash, and will not start since they’re over the limit. To overcome this, set the disk space to current usage level (the system will give this to you when it fails to start), start the ve, then re-set the disk space back to the prior level. Lastly, contact the customer to let them know they’re out of disk (or allocate more disk if they're entitled to more).