Editing
VPS Management
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Recovering from a crash (FreeBSD) == === Diagnose whether you have a crash === The most important thing is to get the machine and all jails back up as soon as possible. Note the time, you’ll need to create a crash log entry (Mgmt. -> Reference -> CrashLog). The first thing to do is head over to the [[Screen#Screen_Organization|serial console screen]] and see if there’s any kernel error messages output. Try to copy any messages (or just a sample of repeating messages) you see into the notes section of the crash log. If there are no messages, the machine may just be really busy- wait a bit (5-10min) to see if it comes back. If it's still pinging, odds are its very busy. Note, if you see messages about swap space exhausted, the server is obviously out of memory, however it may recover briefly enough for you to get a jtop in to see who's lauched a ton of procs (most likely) and then issue a quick jailkill to get it back under control. If it doesn't come back, or the messages indicate a fatal error, you will need to proceed with a power cycle (ctrl+alt+del will not work). === Power cycle the server === If this machine is not a Dell 2950 with a [[DRAC/RMM#DRAC|DRAC card]] (i.e. if you can’t ssh into the DRAC card (as root, using the standard root pass) and issue racadm serveraction hardreset then you will need someone at the data center power the macine off, wait 30 sec, then turn it back on. Make sure to re-attach via console: tip jailX immediately after power down. === (Re)attach to the console === Stay on the console the entire time during boot. As the BIOS posts- look out for the RAID card output- does everything look healthy? The output may be scrambled, look for "DEGRADED" or "FAILED". Once the OS starts booting you will be disconnected (dropped back to the shell on the console server) a couple times during the boot up. The reason you want to quickly re-attach is two-fold: 1. If you don’t reattach quickly then you won’t get any console output, 2. you want to be attached before the server ''potentially'' starts (an extensive) fsck. If you attach after the fsck begins, you’ll have seen no indication it started an fsck and the server will appear frozen during startup- no output, no response. IMPORTANT NOTE: on some older FreeBSD systems, there will be no output to the video (KVM) console as it boots up. The console output is redirected to the serial port ... so if a jail crashes, and you attach a kvm, the output during the bootup procedure will not be shown on the screen. However, when the bootup is done, you will get a login prompt on the screen and will be able to log in as normal. <tt>/boot/loader.conf</tt> is where serial console redirect output lives, so comment that if you want to catch output on kvm. On newer systems it sends most output to both locations. === Assess the heath of the server === Once the server boots up fully, you should be able to ssh in. Look around- make sure all the mounts are there and reporting the correct size/usage (i.e. /mnt/data1 /mnt/data2 /mnt/data3 - look in /etc/fstab to determine which mount points should be there), check to see if RAID mirrors are healthy. See [[RAID_Cards#Common_CLI_commands_.28megacli.29|megacli]], [[#aaccheck|aaccheck]] Before you start the jails, you need to run [[#preboot|preboot]]. This will do some assurance checks to make sure things are prepped to start the jails. Any issues that come out of preboot need to be addressed before starting jails. === Start jails === [[#Starting_jails:_Quad.2FSafe_Files|More on starting jails]] Customer jails (the VPSs) do not start up automatically at boot time. When a FreeBSD machines boots up, it boots up, and does nothing else. To start jails, we put the commands to start each jail into a shell script(s) and run the script(s). Jail startup is something that needs to be actively monitored, which is why we don’t just run the script automatically. In order to start jails, we run the quad files: quad1 quad2 quad3 and quad4 (on new systems there is only quad1). If the machine was cleanly rebooted- which wouldn't be the case if this was a crash, you may run the safe files (safe1 safe2 safe3 safe4) in lieu of quads. Open up 4 logins to the server (use the windows in [[Screen#Screen_Organization|a9]]) In each of the 4 windows you will: If there is a [[#startalljails|startalljails]] script (and only quad1), run that command in each of the 4 windows. It will parse through the quad1 file and start each jail. Follow the instructions [[#Problems_with_the_quad.2Fsafe_files|here]] for monitoring startup. Note that you can be a little more lenient with jails that take awhile to start- startalljails will work around the slow jails and start the rest. As long as there aren't 4 jails which are "hung" during startup, the rest will get started eventually. -or- If there is no startalljails script, there will be multiple quad files. In each of the 4 windows, start each of the quads. i.e. start quad1 in window1, quad2 in window2 and so on. DO NOT start any quad twice. It will crash the server. If you accidentally do this, just jailkill all the jails which are in the quad and run the quad again. Follow the instructions here for monitoring quad startup. Note the time the last jail boots- this is what you will enter in the crash log. Save the crash log. === Check to make sure all jails have started === There's a simple script which will make sure all jails have started, and enter the ipfw counter rules: [[#postboot|postboot]] Run postboot, which will do a jailps on each jail it finds (excluding commented out jails) in the quad file(s). We're looking for 2 things: # systems spawning out of control or too many procs # jails which haven't started On 7.x and newer systems it will print out the problems (which jails haven't started) at the conclusion of postboot. On older systems you will need to watch closely to see if/when there's a problem, namely: [hostname] doesnt exist on this server When you get this message, it means one of 2 things: 1. the jail really didn't start: When a jail doesn't start it usually boils down to a problem in the quad file. Perhaps the path name is wrong (data1 vs data2) or the name of the vn/mdfile is wrong. Once this is corrected, you will need to run the commands from the quad file manually, or you may use <tt>startjail <hostname></tt> 2. the customer has changed their hostname (and not told us) so their jail ''is'' running, just under a different hostname: On systems with jls, this is easy to rectify. First, get the customer info: <tt>g <hostname></tt> Then look for the customer in jls: <tt>jls | grep <col0XXXX></tt> From there you will see their new hostname- you should update that hostname in the quad file: don't forget to edit it on the <tt>## begin ##</tt> and <tt>## end ##</tt> lines, and in mgmt. On older systems without jls, this will be harder, you will need to look further to see their hostname- perhaps its in their /etc/rc.conf Once all jails are started, do some spot checks- try to ssh or browse to some customers, just to make sure things are really ok.
Summary:
Please note that all contributions to JCWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
JCWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information