Latest revision as of 11:39, 10 June 2020

Daily Tasks[edit]

check load graphs[edit]

Click on the Load link in mgmt

This screen shows you load levels on our servers and network traffic for critical machines (firewalls, backup servers).

If you see load high or increasing

FreeBSD: run jtop (or jt > 7.x) and see if there are any runaway processes. Here are some examples of entries in top that are definitely runaway processes:

79481 root      64   0  2256K  1056K CPU1   1  58:16 87.40% 87.40% nano
50650   1000    64   0  1852K  1112K RUN    0 207.9H 84.08% 84.08% screen
14829 www        2   0 39100K 31736K accept 0  104:24  46.54%  6.54% httpd
42065 root      61   0  1300K   844K RUN    1  47.8H 91.36% 91.36% ee
1328 www       56   0 18440K 10796K CPU1   0  64.4H 97.71% 97.71% httpd
26251 user      57   0  6124K  1160K CPU1   1  82.9H 98.44% 98.44% screen
89874 root      60   0  1352K   892K RUN    1  33.8H 65.82% 65.82% dialog
38656   1000    64   0  3088K  2136K CPU0   0 806:13 97.95% 97.95% StutBot
27630 root      64   0  1396K   972K RUN    1  76.8H 86.47% 86.47% ee

Linux: run vwe to see which VPS’s have high loads. From there run vp <veid> and/or vt <veid> to see what's going on in that system. vzstat will also give you a nice picture of whats going on, systems with high numbers in the mlat column are likely culprits.

examples of out of control procs:

12183 nobody    16   0  4916 1348  1340 R    45.5  0.0  4249m httpd
29266 #502      16   0  1852  796   792 R    22.5  0.0  1104m vim
23860 #41       16   0  5472 5472  2076 R    98.9  0.2  31:41 python
19227 bin       19   0  1688  716   652 R    99.9  0.0 321:08 wtrs_ui
 7762 apache    16   0   268  236   224 R    85.7  0.0  1010m ptrace
 4624 #501      20   0  4304 2400  2044 R    53.6  0.1 284:32 YoSucker
20451 #506      20   0  1876  820   816 R    17.2  0.0 169:35 vim
 8834 #514      20   0   900  724   672 R    77.6  0.0 382:30 neostats
31815 apache    14   0  3176 3176  1696 R    74.4  0.1   6:15 counter

Just kill -9 them and be done with it.

Also, anytime you see `kmod` or `ptrace` - kill those immediaely no matter how much they are using - they are attempts to exploit the linux ptrace bug. They won't work, but they suck a lot of CPU...

Also, any other processes that are at 90-100% cpu usage and have been running for any long period of time should be killed except for mysqld processes on FreeBSD. See above.

However, there is an exception:

if it is a mysqld, we don't want to kill their database. What you want to do is jpid <pid> to see who owns it, and then email them the paste containing the instructions for the nanny. Or you can simply do a kill -1 PID on the process to restart it.

Load averages jump at night[edit]

The load averages on the FreeBSD systems may jump up at night between 1 and 4 am - this is because the backups are running - if this is what is causing a jump in load, you will see processes like `rsync` in top eating a lot of CPU time.

check backups[edit]

mgmt -> Motnroing -> Backups and make sure every machine was backed up the previous nite. Also look at df on backup1 and backup2 to make sure no disk is approaching full, though bb should warn us in advance. Please note - errors encountered when a backup script on any of the particular systems run will gnerate an email to support@johncompanies.com so you can know immediately the day after if the directory to be backed up has been moved or no longer exists. A paste exists for this to notify the customer of a non-existant file/dir.

check bb for warnings[edit]

mgmt -> BigBrother

Some events don't generate pages (on purpose). You will only see them by going to the bb main page.

check jail5 for crashed VPSs[edit]

On jail5

 notrunning

To restart a VPS

 vm restart col0XXXX

Check NetHere[edit]

Check the NetHere servers. To get into the servers, login to admin-1.nethere.net and su - to root.

Mail systems[edit]

Check for possible SPAMMERs.

Incoming[edit]

Check mta-1 and mta-2 count of customer logins for possible customer SPAM compromises.

 login_count /logs/maillog | tail -30

Outgoing[edit]

Check outgoing queues on relay-1 and relay-2

 mail_count | tail -30

To clean up outgoing queues of unwanted SPAM on relay-1 and relay-2.

 mail_cleanup [ <sender's domain/username/message id> ... ]

To just remove emails from some senders.

 rmmails <sender's domain/username/message id> ...

Nagios[edit]

Check for other problems on NetHere servers

 https://nagios.nethere.net

Cacti[edit]

Check bandwidth usage on servers

 https://cacti.nethere.net

Monthly Tasks[edit]

rotate pine sent mail (1st of month)[edit]

On the 1st of the month, before any emails are sent out, quit out of pine, then log back in. Send mail from last month will be archived. If you mess up and do it on the 3rd (for example), you can go into the previous month's saved email and save emails from the current month into the sent-mail (current month) mailbox.

b/w caps[edit]

On the 1st: remove any bwcaps put into the firewall (only really applies if a bwcap was added cause someone went over on b/w):

ipfw list|grep pipe
ipfw del [each rule listed]

NOTE: this cronjob on newgateway will do some of that for you, provided you used one of the following pipe #s:

0 0 1 * * /sbin/ipfw del 3  4 5 17331

We really don’t do this anymore since we have centralized traffic accounting with netflow, but for posterity:

Make sure all machines reset counters to 0 after midnight on the 1st Make sure they dumped a counter

On each jail run:

trafficgather.pl

And on each virt:

linuxtrafficgather.pl

Monthly RAID checks[edit]

Every month we check the health of and verfy the parity on all our RAID-based systems. To facilitate this, we've created a simple script to start the process:

sh /root/verify.sh

Adaptec controllers[edit]

Here's some sample output:

mail /usr/local/www/scripts# sh /root/verify.sh
---------------------------------------------------------------------------------------------

Adaptec SCSI RAID Controller Command Line Interface
Copyright 1998-2002 Adaptec, Inc. All rights reserved
---------------------------------------------------------------------------------------------


CLI > open aac0
Executing: open "aac0"

AAC0> container list /f
Executing: container list /full=TRUE
Num          Total  Oth Chunk          Scsi   Partition
Creation        System
Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size   State   RO Lk Task    Done%  Ent
Date   Time      Files
----- ------ ------ --- ------ ------- ------ ------------- ------- -- -- ------- ------ ---
------ -------- ------
 0    Mirror 33.9GB            Open    0:01:0 64.0KB:33.9GB Normal                        0
071002 05:39:32
 /dev/aacd0           mirror0          0:00:0 64.0KB:33.9GB Normal                        1
071002 05:39:32

 1    Mirror 33.9GB            Open    0:02:0 64.0KB:33.9GB Normal                        0
071002 05:39:50
 /dev/aacd1           mirror1          0:03:0 64.0KB:33.9GB Normal                        1
071002 05:39:50


AAC0> disk list /f
Executing: disk list /full=TRUE

B:ID:L  Device Type     Removable media  Vendor-ID Product-ID        Rev   Blocks    Bytes/Bl
ock Usage            Shared Rate
------  --------------  ---------------  --------- ----------------  ----- --------- --------
--- ---------------- ------ ----
0:00:0   Disk            N                FUJITSU   MAJ3364MC         3702  71390320  512
     Initialized      NO     160
0:01:0   Disk            N                FUJITSU   MAJ3364MC         3702  71390320  512
     Initialized      NO     160
0:02:0   Disk            N                FUJITSU   MAJ3364MC         3702  71390320  512
     Initialized      NO     160
0:03:0   Disk            N                FUJITSU   MAJ3364MC         3702  71390320  512
     Initialized      NO     160

AAC0> disk show smart
Executing: disk show smart

        Smart    Method of         Enable
        Capable  Informational     Exception  Performance  Error
B:ID:L  Device   Exceptions(MRIE)  Control    Enabled      Count
------  -------  ----------------  ---------  -----------  ------
0:00:0     Y            6             Y           N             0
0:01:0     Y            6             Y           N             0
0:02:0     Y            6             Y           N             0
0:03:0     Y            6             Y           N             0
0:06:0     N

AAC0> task list
Executing: task list

Controller Tasks

TaskId Function  Done%  Container State Specific1 Specific2
------ -------- ------- --------- ----- --------- ---------

No tasks currently running on controller

AAC0> dia sh hi
Executing: diagnostic show history
No switches specified, defaulting to "/current".



 *** HISTORY BUFFER FROM CURRENT CONTROLLER RUN ***

[00]: GetDiskLogEntry: container - 1, entry return 0
[01]: Container 1 started SCRUB task
[02]: Starting Mirror:1 scrub
[03]: Master disk: 2, start sector: 128, sector count = 71286784
[04]: Slave  disk: 3, start sector: 128, sector count = 71286784
[05]: UpdateDiskLogIndex - Set   - container 0, index 1
[06]: GetDiskLogEntry: container - 0, entry return 1
[07]: Container 0 started SCRUB task
[08]: Starting Mirror:0 scrub
[09]: Master disk: 1, start sector: 128, sector count = 71286784
[10]: Slave  disk: 0, start sector: 128, sector count = 71286784
[11]: Mirror Scrub Container:1   ErrorsFound:0
[12]: Clear disk log: sector - 80, driveno 2
[13]: Clear disk log: sector - 80, driveno 3
[14]: Container 1 completed SCRUB task:
[15]: Mirror Scrub Container:0   ErrorsFound:0
[16]: Clear disk log: sector - 81, driveno 1
[17]: Clear disk log: sector - 81, driveno 0
[18]: Container 0 completed SCRUB task:
[19]: UpdateDiskLogIndex - Set   - container 0, index 0
[20]: GetDiskLogEntry: container - 0, entry return 0
[21]: Container 0 started SCRUB task
[22]: Starting Mirror:0 scrub
[23]: Master disk: 1, start sector: 128, sector count = 71286784
[24]: Slave  disk: 0, start sector: 128, sector count = 71286784
[25]: UpdateDiskLogIndex - Set   - container 1, index 1
[26]: GetDiskLogEntry: container - 1, entry return 1
[27]: Container 1 started SCRUB task
[28]: Starting Mirror:1 scrub
[29]: Master disk: 2, start sector: 128, sector count = 71286784
[30]: Slave  disk: 3, start sector: 128, sector count = 71286784
[31]: Mirror Scrub Container:1   ErrorsFound:0
[32]: Clear disk log: sector - 81, driveno 2
[33]: Clear disk log: sector - 81, driveno 3
[34]: Container 1 completed SCRUB task:
[35]: Mirror Scrub Container:0   ErrorsFound:0
[36]: Clear disk log: sector - 80, driveno 1
[37]: Clear disk log: sector - 80, driveno 0
[38]: Container 0 completed SCRUB task:
[39]: UpdateDiskLogIndex - Set   - container 0, index 0
[40]: GetDiskLogEntry: container - 0, entry return 0
[41]: Container 0 started SCRUB task
[42]: Starting Mirror:0 scrub
[43]: Master disk: 1, start sector: 128, sector count = 71286784
[44]: Slave  disk: 0, start sector: 128, sector count = 71286784
[45]: UpdateDiskLogIndex - Set   - container 1, index 1
[46]: GetDiskLogEntry: container - 1, entry return 1
[47]: Container 1 started SCRUB task
[48]: Starting Mirror:1 scrub
[49]: Master disk: 2, start sector: 128, sector count = 71286784
[50]: Slave  disk: 3, start sector: 128, sector count = 71286784
[51]: Mirror Scrub Container:1   ErrorsFound:0
[52]: Clear disk log: sector - 81, driveno 2
[53]: Clear disk log: sector - 81, driveno 3
[54]: Container 1 completed SCRUB task:
[55]: Mirror Scrub Container:0   ErrorsFound:0
[56]: Clear disk log: sector - 80, driveno 1
[57]: Clear disk log: sector - 80, driveno 0
[58]: Container 0 completed SCRUB task:
[59]: UpdateDiskLogIndex - Set   - container 0, index 0
[60]: GetDiskLogEntry: container - 0, entry return 0
[61]: Container 0 started SCRUB task
[62]: Starting Mirror:0 scrub
[63]: Master disk: 1, start sector: 128, sector count = 71286784
[64]: Slave  disk: 0, start sector: 128, sector count = 71286784
[65]: UpdateDiskLogIndex - Set   - container 1, index 1
[66]: GetDiskLogEntry: container - 1, entry return 1
[67]: Container 1 started SCRUB task
[68]: Starting Mirror:1 scrub
[69]: Master disk: 2, start sector: 128, sector count = 71286784
[70]: Slave  disk: 3, start sector: 128, sector count = 71286784
[71]: Mirror Scrub Container:1   ErrorsFound:0
[72]: Clear disk log: sector - 81, driveno 2
[73]: Clear disk log: sector - 81, driveno 3
[74]: Container 1 completed SCRUB task:
[75]: Mirror Scrub Container:0   ErrorsFound:0
[76]: Clear disk log: sector - 80, driveno 1
[77]: Clear disk log: sector - 80, driveno 0
[78]: Container 0 completed SCRUB task:
[79]: UpdateDiskLogIndex - Set   - container 0, index 0
[80]: GetDiskLogEntry: container - 0, entry return 0
[81]: Container 0 started SCRUB task
[82]: Starting Mirror:0 scrub
[83]: Master disk: 1, start sector: 128, sector count = 71286784
[84]: Slave  disk: 0, start sector: 128, sector count = 71286784
[85]: UpdateDiskLogIndex - Set   - container 1, index 1
[86]: GetDiskLogEntry: container - 1, entry return 1
[87]: Container 1 started SCRUB task
[88]: Starting Mirror:1 scrub
[89]: Master disk: 2, start sector: 128, sector count = 71286784
[90]: Slave  disk: 3, start sector: 128, sector count = 71286784
[91]: Mirror Scrub Container:1   ErrorsFound:0
[92]: Clear disk log: sector - 81, driveno 2
[93]: Clear disk log: sector - 81, driveno 3
[94]: Container 1 completed SCRUB task:
[95]: Mirror Scrub Container:0   ErrorsFound:0
[96]: Clear disk log: sector - 80, driveno 1
[97]: Clear disk log: sector - 80, driveno 0
[98]: Container 0 completed SCRUB task:
[99]:

========================
History Output Complete.

AAC0>
AAC0> exit
Executing: exit

press enter when ready to run verify                                                 <INS>
---------------------------------------------------------------------------------------------

Adaptec SCSI RAID Controller Command Line Interface
Copyright 1998-2002 Adaptec, Inc. All rights reserved
---------------------------------------------------------------------------------------------


CLI > open aac0
Executing: open "aac0"

AAC0> contai scr 0
Executing: container scrub 0

AAC0> contai scr 1
Executing: container scrub 1

AAC0> exit
Executing: exit

when done run:                                                                       

aaccli
open aac0
dia sh hi
c


Nov  1 10:32:46 mail /kernel: aac0: **Monitor** Container 0 started SCRUB task
Nov  1 10:32:47 mail /kernel: aac0: **Monitor** Container 1 started SCRUB task

Here's an analysis of what we're seeing and what we're looking for:

AAC0> container list /f
Executing: container list /full=TRUE
Num          Total  Oth Chunk          Scsi   Partition
Creation        System
Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size   State   RO Lk Task    Done%  Ent
Date   Time      Files
----- ------ ------ --- ------ ------- ------ ------------- ------- -- -- ------- ------ ---
------ -------- ------
 0    Mirror 33.9GB            Open    0:01:0 64.0KB:33.9GB Normal                        0
071002 05:39:32
 /dev/aacd0           mirror0          0:00:0 64.0KB:33.9GB Normal                        1
071002 05:39:32

 1    Mirror 33.9GB            Open    0:02:0 64.0KB:33.9GB Normal                        0
071002 05:39:50
 /dev/aacd1           mirror1          0:03:0 64.0KB:33.9GB Normal                        1
071002 05:39:50

This is showing you the health of the arrays. You're looking for Normal under the State column, and the absence of a ! in the sector size - sometimes, you'll see this:

64.0KB!33.9GB

That indicates a problem.

AAC0> disk show smart
Executing: disk show smart

        Smart    Method of         Enable
        Capable  Informational     Exception  Performance  Error
B:ID:L  Device   Exceptions(MRIE)  Control    Enabled      Count
------  -------  ----------------  ---------  -----------  ------
0:00:0     Y            6             Y           N             0
0:01:0     Y            6             Y           N             0
0:02:0     Y            6             Y           N             0
0:03:0     Y            6             Y           N             0
0:06:0     N

This shows you a SMART report output. Looking for values in the Error Count column.

AAC0> task list
Executing: task list

Controller Tasks

TaskId Function  Done%  Container State Specific1 Specific2
------ -------- ------- --------- ----- --------- ---------

No tasks currently running on controller

Look for absence of tasks running- a bad thing would be to see a rebuild or verify running when you didn't initiate it.

With the history output, you're looking for any anomalies or events since the last time a verify was run. If you see a drive with lots of problems, you may want to take backups before allowing the verify to run since it could replicate errors onto the good drive.

After you see the history output, it will prompt you to press enter to run the verify. If you're happy with all the output you're seeing- mirror is healthy, history looks good, it's safe to proceed. Otherwise ^C to exit. After hitting enter it will start the verify and start to tail the messages log file (so you can easily see when the verify is complete). Here's what that'll look like:

Nov  1 14:38:08 mail /kernel: aac0: **Monitor** Container 1 completed SCRUB task:
Nov  1 14:46:45 mail /kernel: aac0: **Monitor** Container 0 completed SCRUB task:

So, putting it all together, after hitting enter to start the verify, you'll see:

---------------------------------------------------------------------------------------------

Adaptec SCSI RAID Controller Command Line Interface
Copyright 1998-2002 Adaptec, Inc. All rights reserved
---------------------------------------------------------------------------------------------


CLI > open aac0
Executing: open "aac0"

AAC0> contai scr 0
Executing: container scrub 0

AAC0> contai scr 1
Executing: container scrub 1

AAC0> exit
Executing: exit

when done run:                                                                       

aaccli
open aac0
dia sh hi
c


Nov  1 10:32:46 mail /kernel: aac0: **Monitor** Container 0 started SCRUB task
Nov  1 10:32:47 mail /kernel: aac0: **Monitor** Container 1 started SCRUB task

When the scrub(s) (verify) are complete - if the server has multiple logical drives, it will run both in parallel - you should exit the tail of the log file (^C) and run:

aaccli
open aac0
dia sh hi
c

Which will show you the diagnostic history, you're looking for the results of the most recent scrub:

[100]: Mirror Scrub Container:1   ErrorsFound:0
[101]: Clear disk log: sector - 81, driveno 2
[102]: Clear disk log: sector - 81, driveno 3
[103]: Container 1 completed SCRUB task:
[104]: Mirror Scrub Container:0   ErrorsFound:0
[105]: Clear disk log: sector - 80, driveno 1
[106]: Clear disk log: sector - 80, driveno 0
[107]: Container 0 completed SCRUB task:

^C to exit the RAID CLI.

If you see:

[104]: Mirror Scrub Container:0   ErrorsFound:5

You'll want to rerun the verify on that drive till it shows 0, or perhaps replace the drive- you should be able to see from the output which drive had the problem.

Depending on the size and how busy the drive is, the verify can take anywhere from an hour to the better part of a day.

You will notice that the diagnostic history is not shown on our modern adaptec cards (i.e. any adaptec card not in a Dell 2450). The reason for this is the history is never cleared, so there's simply too much data to show and it just crashes the CLI. So, don't bother trying to see it...which does make it hard to see if there are problems going on, so you just need to watch the scrub to see it goes to 100%. You will also notice that on some servers there's no tail of messages. Again, this is cause no data is shown there about the completion of the scrub. The thing to do here is to go into the CLI and continue to show tasks to monitor scrub progress.

See Adaptec RAID CLI Reference for more details on how to use the CLI

DELL (LSI-based) SAS controllers[edit]

Here's what the output looks like when running verify.sh on a LSI-based card:

jail2 /mnt/data2# sh /root/verify.sh

Adapter #0

Enclosure Device ID: 32
Slot Number: 0
Device Id: 0
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 140014MB [0x11177328 Sectors]
Non Coerced Size: 139502MB [0x11077328 Sectors]
Coerced Size: 139392MB [0x11040000 Sectors]
Firmware state: Online
SAS Address(0): 0x500000e018396142
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: FUJITSU MAX3147RC       D207DQ03P7A0DESN
Foreign State: None
Media Type: Hard Disk Device
Device Speed: Unknown
Link Speed: Unknown

Enclosure Device ID: 32
Slot Number: 1
Device Id: 1
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 140014MB [0x11177328 Sectors]
Non Coerced Size: 139502MB [0x11077328 Sectors]
Coerced Size: 139392MB [0x11040000 Sectors]
Firmware state: Online
SAS Address(0): 0x500000e018395db2
SAS Address(1): 0x0
Connected Port Number: 1(path0)
Inquiry Data: FUJITSU MAX3147RC       D207DQ03P7A0DERV
Foreign State: None
Media Type: Hard Disk Device
Device Speed: Unknown
Link Speed: Unknown

Enclosure Device ID: 32
Slot Number: 2
Device Id: 2
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 286102MB [0x22ecb25c Sectors]
Non Coerced Size: 285590MB [0x22dcb25c Sectors]
Coerced Size: 285568MB [0x22dc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x5000c50006eece89
SAS Address(1): 0x0
Connected Port Number: 2(path0)
Inquiry Data: SEAGATE ST3300555SS     T2113LM4BFBZ
Foreign State: None
Media Type: Hard Disk Device
Device Speed: Unknown
Link Speed: Unknown

Enclosure Device ID: 32
Slot Number: 3
Device Id: 3
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 286102MB [0x22ecb25c Sectors]
Non Coerced Size: 285590MB [0x22dcb25c Sectors]
Coerced Size: 285568MB [0x22dc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x5000c50006eee035
SAS Address(1): 0x0
Connected Port Number: 3(path0)
Inquiry Data: SEAGATE ST3300555SS     T2113LM4BGF7
Foreign State: None
Media Type: Hard Disk Device
Device Speed: Unknown
Link Speed: Unknown

Enclosure Device ID: 32
Slot Number: 4
Device Id: 4
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 286102MB [0x22ecb25c Sectors]
Non Coerced Size: 285590MB [0x22dcb25c Sectors]
Coerced Size: 285568MB [0x22dc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x5000c50004bd7ea5
SAS Address(1): 0x0
Connected Port Number: 4(path0)
Inquiry Data: SEAGATE ST3300656SS     HS093QP0G8SW
Foreign State: None
Media Type: Hard Disk Device
Device Speed: Unknown
Link Speed: Unknown

Enclosure Device ID: 32
Slot Number: 5
Device Id: 5
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 286102MB [0x22ecb25c Sectors]
Non Coerced Size: 285590MB [0x22dcb25c Sectors]
Coerced Size: 285568MB [0x22dc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x500000e01f1c4112
SAS Address(1): 0x0
Connected Port Number: 5(path0)
Inquiry Data: FUJITSU MBA3300RC       D306BJ15P9201W06
Foreign State: None
Media Type: Hard Disk Device
Device Speed: Unknown
Link Speed: Unknown


Exit Code: 0x00


Adapter 0 -- Virtual Drive Information:
Virtual Disk: 0 (Target Id: 0)
Name:
RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
Size:139392MB
State: Optimal
Stripe Size: 64kB
Number Of Drives:2
Span Depth:1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default

Exit Code: 0x00


Adapter 0 -- Virtual Drive Information:
Virtual Disk: 1 (Target Id: 1)
Name:MIRROR1
RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
Size:285568MB
State: Optimal
Stripe Size: 64kB
Number Of Drives:2
Span Depth:1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default

Exit Code: 0x00


Adapter 0 -- Virtual Drive Information:
Virtual Disk: 2 (Target Id: 2)
Name:MIRROR2
RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
Size:285568MB
State: Optimal
Stripe Size: 64kB
Number Of Drives:2
Span Depth:1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default

Exit Code: 0x00
Battery FRU     : N/A
Battery Warning                  : Enabled
Memory Correctable Errors   : 0
Memory Uncorrectable Errors : 0
BBU             : Present
BBU                             : Yes
Cache When BBU Bad               : Disabled
press enter when ready to run verify

Before pressing enter, here's what we're looking for:

Enclosure Device ID: 32
Slot Number: 0
Device Id: 0
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 140014MB [0x11177328 Sectors]
Non Coerced Size: 139502MB [0x11077328 Sectors]
Coerced Size: 139392MB [0x11040000 Sectors]
Firmware state: Online
SAS Address(0): 0x500000e018396142
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: FUJITSU MAX3147RC       D207DQ03P7A0DESN
Foreign State: None
Media Type: Hard Disk Device
Device Speed: Unknown
Link Speed: Unknown

This is the output shown for each physical drive in the system. We're looking to confirm it's Firmware state is Online, and Media Error Count, Other Error Count, and Predictive Failure Count are all zero (or near zero).

Adapter 0 -- Virtual Drive Information:
Virtual Disk: 1 (Target Id: 1)
Name:MIRROR1
RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
Size:285568MB
State: Optimal
Stripe Size: 64kB
Number Of Drives:2
Span Depth:1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default

This is the output for each logical drive. We're looking for State Optimal. Also confirm Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

Exit Code: 0x00
Battery FRU     : N/A
Battery Warning                  : Enabled
Memory Correctable Errors   : 0
Memory Uncorrectable Errors : 0
BBU             : Present
BBU                             : Yes
Cache When BBU Bad               : Disabled

Confirm that the battery is present and error-free.

If all that checks out, you're ready to proceed with the verify. After pressing enter, the verify is started and here's what you see:

Start Check Consistency on Virtual Drive 0 (target id: 0) Success.

Exit Code: 0x00

Start Check Consistency on Virtual Drive 1 (target id: 1) Success.

Exit Code: 0x00

Start Check Consistency on Virtual Drive 2 (target id: 2) Success.

Exit Code: 0x00

  Check Consistency

 Progress of Virtual Drives...

  Virtual Drive #              Percent Complete                       Time Elps
          0         °°°°°°°°°°°°°°°°°°°°°°°00 %°°°°°°°°°°°°°°°°°°°°°°° 00:00:03
          1         °°°°°°°°°°°°°°°°°°°°°°°00 %°°°°°°°°°°°°°°°°°°°°°°° 00:00:02
          2         °°°°°°°°°°°°°°°°°°°°°°°00 %°°°°°°°°°°°°°°°°°°°°°°° 00:00:01

    Press <ESC> key to quit...

The progress for each drive is displayed until all drives have completed the verify. We just want to make sure that each drive goes to completion. No followup is needed...though there probably is a log or history where we can get more info.

You will notice that jail7 does not run a verify- that's on purpose. The last time we tried this it crashed the system. So, this must be run from the BIOS (take the system offline for a couple hours).

See LSI RAID CLI Reference for more details on how to use the CLI

LSI-based controllers (megaraid)[edit]

There is a CLI for this however it's easier to do this with a curses GUI app: megaraid

Currently only on these servers: virt15, virt16, and firewall2

To run:

# cd /usr/local/sbin/; megamgr

Main menu:

²ÚÄÄManagement MenuÄÄ¿²
²³ Configure         ³²
²³ Initialize        ³²
²³ Objects           ³²
²³ Rebuild           ³²
²³ Check Consistency ³²
²³ Advanced Menu     ³²
²ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ²

Before you check consistency, make sure the arrays are healthy.

Objects -> Physical Drive

Then look to make sure they're all ONLIN

²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²²²²²²²²ÚÄÄÄÄÄÄÄÄÄÄÄÄObjects - PHYSICAL DRIVE SELECTION MENUÄÄÄÄÄÄÄÄÄÄÄÄÄ¿²²²²²²
²²²²²²²²³                                                                ³²²²²²²
²²²²²²²²³                         Channel-1                              ³²²²²²²
²ÚÄÄMana³                     ID ÉÍÍÍÍÍÍÍÍÍÍÍÍÍÍ»x                       ³²²²²²²
²³ Confi³                       0º* ONLIN A01-01º                        ³²²²²²²
²³ Initi³                        ÌÍÍÍÍÍÍÍÍÍÍÍÍÍÍ¹                        ³²²²²²²
²³ Objec³                       1º* ONLIN A01-02º                        ³²²²²²²
²³ Rebui³                        ÌÍÍÍÍÍÍÍÍÍÍÍÍÍÍ¹                        ³²²²²²²
²³ Check³                       2º* ONLIN A02-01º                        ³²²²²²²
²³ Advan³                        ÌÍÍÍÍÍÍÍÍÍÍÍÍÍÍ¹                        ³²²²²²²
²ÀÄÄÄÄÄÄ³                       3º* ONLIN A02-02º                        ³²²²²²²
²²²²²²²²³                        ÌÍÍÍÍÍÍÍÍÍÍÍÍÍÍ¹                        ³²²²²²²
²²²²²²²²³                       4º* ONLIN A03-01º                        ³²²²²²²
²²²²²²²²³                        ÌÍÍÍÍÍÍÍÍÍÍÍÍÍÍ¹                        ³²²²²²²
²²²²²²²²³                       5º* ONLIN A03-02ºþ                       ³²²²²²²
²²²²²²²²³                        ÌÍÍÍÍÍÍÍÍÍÍÍÍÍÍ¹                        ³²²²²²²
²²²²²²²²³                       6º*             º                        ³²²²²²²
²²²²²²²²³                        ÈÍÍÍÍÍÍÍÍÍÍÍÍÍÍ¼x                       ³²²²²²²
²²²²²²²²ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ²²²²²²
²²²²²²²²ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿²²²²²²²²
²²²²²²²²³Ch-1 ID-5  DISK      140013MB  SEAGATE  ST3146707LC      0003 ³²²²²²²²²
²²²²²²²²ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ²²²²²²²²

Once that's done, hit escape once then the back arror to move back to the Objects menu. So you select Objects -> Logical Drive -> Logical Drive 1 -> Check Consistency -> YES


²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²²²²²²²²²²²²²²²²²²²²²²²²²²²²ÚÄLogical Drives(02)Ä¿²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²²²²²²²²²²²²²²²²²²²²²²²²²²²²³ Logical Drive 1    ³²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²²²²²²²²²²²²²ÚÄÄÄÄObjectsÄÄÄ³ Logical Drive 2    ³²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²ÚÄÄManagemen³ Adapter      ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²³ Configure ³ Logical Drive  ³²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²³ Initialize³ Physical Drive ³²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²³ Objects   ³ Channel        ³²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²³ Rebuild   ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²³ Check Consistency ³²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²³ Advanced Menu     ³²²²²²²²ÚÄÄÄÄLogical Drive 1ÄÄÄÄÄ¿²²²²²²²²²²²²²²²²²²²²²²²²²
²ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ²²²²²²²³ Initialize    ÚÄCheck Consistency-1  ?Ä¿²²²²²²²²²
²²²²²²²²²²²²²²²²²²²²²²²²²²²²²³ Check Consiste³   YES                  ³²²²²²²²²²
²²²²²²²²²²²²²²²²²²²²²²²²²²²²²³ View/Update Pa³   NO                   ³²²²²²²²²²
²²²²²²²²²²²²²²²²²²²²²²²²²²²²²ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ²²²²²²²²²
²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²³Select YES Or NO³²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²
²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²

Then watch the progress. When done, escape back to Logical Drive then repeat for Logical Drive 2. If you ^C or accidentally escape out, you can come back in running the same commands and watch the progress again (it won't restart).

You can exit megamgr by escaping out or ^C

3ware[edit]

We are using 3ware controllers on backup1 & backup2. Running the verify script will give you different output based on the type of controller:

backup2 /d2# sh /root/verify.sh
Controller: c0
-------------
Driver:   1.50.01.002
Model:    7500-8
FW:       FE7X 1.05.00.068
BIOS:     BE7X 1.08.00.048
Monitor:  ME7X 1.01.00.040
Serial #: F11605A3180172
PCB:      Rev3
PCHIP:    1.30-33
ACHIP:    3.20


# of units: 3
        Unit 0: JBOD 186.31 GB ( 390721968 blocks): OK
        Unit 1: RAID 5 465.77 GB ( 976790016 blocks): DEGRADED
        Unit 5: RAID 5 698.65 GB ( 1465185024 blocks): DEGRADED

# of ports: 8
        Port 0: WDC WD2000JB-00KFA0 WD-WCAMT1451690 186.31 GB (390721968 blocks): OK(unit 0)
        Port 1: WDC WD2500JB-00GVC0 WD-WCAL78219488 232.88 GB (488397168 blocks): OK(unit 1)
        Port 2: WDC WD2000  0.00 MB (0 blocks): OK(NO UNIT)
        Port 3: WDC WD2500JB-00GVC0 WD-WMAL73882417 232.88 GB (488397168 blocks): OK(unit 1)
        Port 4: WDC WD2000  0.00 MB (0 blocks): OK(NO UNIT)
        Port 5: WDC WD2500JB-00GVA0 WD-WMAL71338097 232.88 GB (488397168 blocks): OK(unit 5)
        Port 6: WDC WD2500JB-32EVA0 WD-WMAEH1301595 232.88 GB (488397168 blocks): OK(unit 5)
        Port 7: WDC WD2500JB-00GVC0 WD-WCAL78165566 232.88 GB (488397168 blocks): OK(unit 5)
Controller: c1
-------------
Driver:   1.50.01.002
Model:    7500-8
FW:       FE7X 1.05.00.068
BIOS:     BE7X 1.08.00.048
Monitor:  ME7X 1.01.00.040
Serial #: F11605A3180167
PCB:      Rev3
PCHIP:    1.30-33
ACHIP:    3.20


# of units: 2
        Unit 0: RAID 5 698.65 GB ( 1465185024 blocks): OK
        Unit 4: RAID 5 698.65 GB ( 1465185024 blocks): OK

# of ports: 8
        Port 0: WDC WD2500JB-00GVA0 WD-WMAL71301258 232.88 GB (488397168 blocks): OK(unit 0)
        Port 1: WDC WD2500JB-00GVA0 WD-WMAL71322705 232.88 GB (488397168 blocks): OK(unit 0)
        Port 2: WDC WD2500JB-00GVA0 WD-WMAL71945050 232.88 GB (488397168 blocks): OK(unit 0)
        Port 3: WDC WD2500JB-00GVA0 WD-WMAL71316201 232.88 GB (488397168 blocks): OK(unit 0)
        Port 4: WDC WD2500JB-00GVC0 WD-WCAL78323749 232.88 GB (488397168 blocks): OK(unit 4)
        Port 5: WDC WD3200AAJB-00J3A0 WD-WCAV2V689068 298.09 GB (625142448 blocks): OK(unit 4)
        Port 6: WDC WD2500JB-00GVC0 WD-WCAL78234420 232.88 GB (488397168 blocks): OK(unit 4)
        Port 7: WDC WD2500JB-00GVC0 WD-WCAL78592213 232.88 GB (488397168 blocks): OK(unit 4)
backup2 /d2#

On backup2 look for all ok, no verify.

[root@backup3 ~]# sh /root/verify.sh
/c2 Driver Version = 1.26.02.002
/c2 Model = 8006-2LP
/c2 Available Memory = 512KB
/c2 Firmware Version = FE8S 1.05.00.068
/c2 Bios Version = BE7X 1.08.00.048
/c2 Boot Loader Version = ME7X 1.01.00.040
/c2 Serial Number = L018501C6481395
/c2 PCB Version = Rev5
/c2 PCHIP Version = 1.30-66
/c2 ACHIP Version = 3.20
/c2 Total Optimal Units = 1
/c2 Not Optimal Units = 0

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u1    RAID-1    OK             -       -       -       931.512   ON     -

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u1     931.51 GB   1953525168    WD-WMAW31148820
p1     OK               u1     931.51 GB   1953525168    WD-WCATR0277515


Ctl  Date                        Severity  Alarm Message
------------------------------------------------------------------------------

Sending start verify message to /c2/u1 ... Done.


when done run:
tw_cli /c2 show alarms

[root@backup3 ~]#

Automatically starts the verify, just run tw_cli /c2 show alarms as instructed to see the results of the verify.

[root@backup1 /data/deprecated]# sh /root/verify.sh
/c0 Driver Version = 2.26.02.010
/c0 Model = 9650SE-8LPML
/c0 Available Memory = 224MB
/c0 Firmware Version = FE9X 4.06.00.004
/c0 Bios Version = BE9X 4.05.00.015
/c0 Boot Loader Version = BL9X 3.08.00.001
/c0 Serial Number = L326025A8270177
/c0 PCB Version = Rev 032
/c0 PCHIP Version = 2.00
/c0 ACHIP Version = 1.90
/c0 Number of Ports = 8
/c0 Number of Drives = 6
/c0 Number of Units = 1
/c0 Total Optimal Units = 1
/c0 Not Optimal Units = 0
/c0 JBOD Export Policy = off
/c0 Disk Spinup Policy = 1
/c0 Spinup Stagger Time Policy (sec) = 1
/c0 Auto-Carving Policy = off
/c0 Auto-Carving Size = 2048 GB
/c0 Auto-Rebuild Policy = on
/c0 Controller Bus Type = PCIe
/c0 Controller Bus Width = 1 lane
/c0 Controller Bus Speed = 2.5 Gbps/lane

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    OK             -       -       64K     4656.56   ON     ON

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     931.51 GB   1953525168    9QJ1Y017
p1     DEVICE-ERROR     u0     931.51 GB   1953525168    9QJ1ZN07
p2     OK               u0     931.51 GB   1953525168    9QJ2XK1R
p3     OK               u0     931.51 GB   1953525168    9QJ2010B
p4     OK               u0     1.36 TB     2930277168    6XW0L36T
p5     OK               u0     931.51 GB   1953525168    WD-WMATV2444836
p6     NOT-PRESENT      -      -           -             -
p7     NOT-PRESENT      -      -           -             -


Ctl  Date                        Severity  Alarm Message
------------------------------------------------------------------------------
c0   [Sat May 12 11:27:15 2012]  WARNING   Sector repair completed: port=0, LBA=0x6AE571C
c0   [Sat May 12 19:16:21 2012]  WARNING   Sector repair completed: port=1, LBA=0x40E62A23
c0   [Sat May 12 21:40:56 2012]  INFO      Verify completed: unit=0
c0   [Mon May 14 00:53:53 2012]  WARNING   Sector repair completed: port=1, LBA=0x7B8CFA7
c0   [Mon May 14 00:58:21 2012]  WARNING   Sector repair completed: port=1, LBA=0x7B8CFAA
c0   [Mon May 14 04:35:13 2012]  WARNING   Sector repair completed: port=0, LBA=0x8FEF2CF
c0   [Mon May 14 04:38:22 2012]  WARNING   Sector repair completed: port=0, LBA=0x8FEF2D1
c0   [Tue May 15 22:53:46 2012]  WARNING   Sector repair completed: port=0, LBA=0x13C2622
c0   [Wed May 16 00:39:31 2012]  WARNING   Sector repair completed: port=0, LBA=0x365A67F
c0   [Wed May 16 00:39:37 2012]  WARNING   Sector repair completed: port=0, LBA=0x365A685
c0   [Wed May 16 00:47:18 2012]  WARNING   Sector repair completed: port=0, LBA=0x365A687
c0   [Sat May 19 00:01:44 2012]  INFO      Verify started: unit=0
c0   [Sat May 19 04:46:20 2012]  WARNING   Sector repair completed: port=0, LBA=0x365A68E
c0   [Sat May 19 13:37:06 2012]  WARNING   Sector repair completed: port=1, LBA=0x7B8CFAC
c0   [Sat May 19 13:37:28 2012]  WARNING   Sector repair completed: port=1, LBA=0x7B8CFAE
c0   [Sat May 19 13:37:47 2012]  WARNING   Sector repair completed: port=1, LBA=0x7B8CFB1
c0   [Sat May 19 13:38:00 2012]  WARNING   Sector repair completed: port=1, LBA=0x7B8CFB3
c0   [Sat May 19 21:47:45 2012]  INFO      Verify completed: unit=0
c0   [Wed May 23 12:21:41 2012]  INFO      Cache synchronization completed: unit=0
c0   [Fri May 25 00:08:19 2012]  WARNING   Sector repair completed: port=0, LBA=0x12DA76C
c0   [Fri May 25 00:08:34 2012]  WARNING   Sector repair completed: port=0, LBA=0x12E4901
c0   [Fri May 25 00:09:33 2012]  WARNING   Sector repair completed: port=0, LBA=0x12DA773
c0   [Fri May 25 00:39:12 2012]  WARNING   Sector repair completed: port=0, LBA=0x42C597B
c0   [Sat May 26 00:01:45 2012]  INFO      Verify started: unit=0
c0   [Sat May 26 00:42:05 2012]  WARNING   Sector repair completed: port=1, LBA=0x323C1AC
c0   [Sat May 26 00:51:43 2012]  WARNING   Sector repair completed: port=1, LBA=0x323C1AE
c0   [Sat May 26 01:54:21 2012]  WARNING   Sector repair completed: port=1, LBA=0x2F0D302
c0   [Sat May 26 02:06:38 2012]  WARNING   Sector repair completed: port=0, LBA=0x12DA777
c0   [Sat May 26 02:07:21 2012]  WARNING   Sector repair completed: port=0, LBA=0x12E48FE
c0   [Sat May 26 04:20:00 2012]  WARNING   Sector repair completed: port=1, LBA=0x2F0D306
c0   [Sat May 26 04:32:58 2012]  WARNING   Sector repair completed: port=1, LBA=0x323C1B1
c0   [Sat May 26 04:33:21 2012]  WARNING   Sector repair completed: port=1, LBA=0x323C1B3
c0   [Sat May 26 04:33:44 2012]  WARNING   Sector repair completed: port=1, LBA=0x323C1BA
c0   [Sat May 26 05:24:07 2012]  WARNING   Sector repair completed: port=1, LBA=0x3F83862
c0   [Sat May 26 05:25:09 2012]  WARNING   Verify fixed data/parity mismatch: unit=0
c0   [Sat May 26 06:08:13 2012]  WARNING   Sector repair completed: port=0, LBA=0x4CDC6A2
c0   [Sat May 26 09:49:35 2012]  WARNING   Sector repair completed: port=1, LBA=0x6CACD4A
c0   [Sat May 26 18:10:44 2012]  WARNING   Sector repair completed: port=1, LBA=0x18F425EA
c0   [Sat May 26 19:45:40 2012]  WARNING   Verify fixed data/parity mismatch: unit=0
c0   [Sat May 26 20:22:52 2012]  WARNING   Verify fixed data/parity mismatch: unit=0
c0   [Sat May 26 20:23:15 2012]  WARNING   Verify fixed data/parity mismatch: unit=0
c0   [Sat May 26 20:23:22 2012]  WARNING   Verify fixed data/parity mismatch: unit=0
c0   [Sat May 26 20:23:35 2012]  WARNING   Verify fixed data/parity mismatch: unit=0
c0   [Sat May 26 20:23:41 2012]  WARNING   Verify fixed data/parity mismatch: unit=0
c0   [Sat May 26 20:23:49 2012]  WARNING   Verify fixed data/parity mismatch: unit=0
c0   [Sat May 26 20:23:57 2012]  WARNING   Verify fixed data/parity mismatch: unit=0
c0   [Sat May 26 20:24:02 2012]  WARNING   Verify fixed data/parity mismatch: unit=0
c0   [Sat May 26 20:54:41 2012]  WARNING   Verify fixed data/parity mismatch: unit=0
c0   [Sat May 26 22:00:30 2012]  INFO      Verify completed: unit=0
c0   [Sat Jun  2 00:01:43 2012]  INFO      Verify started: unit=0
c0   [Sat Jun  2 00:30:17 2012]  WARNING   Sector repair completed: port=0, LBA=0x2B911E4
c0   [Sat Jun  2 00:50:57 2012]  WARNING   Sector repair completed: port=0, LBA=0x5A807CA6
c0   [Sat Jun  2 04:13:13 2012]  WARNING   Sector repair completed: port=0, LBA=0x2D18291
c0   [Sat Jun  2 04:13:35 2012]  WARNING   Sector repair completed: port=0, LBA=0x2D1829F
c0   [Sat Jun  2 21:48:02 2012]  INFO      Verify completed: unit=0
c0   [Mon Jun  4 04:40:34 2012]  WARNING   Sector repair completed: port=1, LBA=0x4AF8098F
c0   [Tue Jun  5 00:28:19 2012]  WARNING   Sector repair completed: port=1, LBA=0x263C5CD
c0   [Tue Jun  5 00:33:06 2012]  WARNING   Sector repair completed: port=1, LBA=0x263C5CF
c0   [Thu Jun  7 00:34:27 2012]  WARNING   Sector repair completed: port=0, LBA=0x2A07B5F
c0   [Thu Jun  7 00:38:50 2012]  WARNING   Sector repair completed: port=0, LBA=0x2A07B61
c0   [Fri Jun  8 00:07:13 2012]  WARNING   Sector repair completed: port=0, LBA=0xC131F6B
c0   [Sat Jun  9 00:01:41 2012]  INFO      Verify started: unit=0
c0   [Sat Jun  9 00:29:11 2012]  WARNING   Sector repair completed: port=0, LBA=0x6C7614D
c0   [Sat Jun  9 00:38:25 2012]  WARNING   Sector repair completed: port=0, LBA=0x6C76152
c0   [Sat Jun  9 04:02:30 2012]  WARNING   Sector repair completed: port=1, LBA=0x263C5D1
c0   [Sat Jun  9 04:02:52 2012]  WARNING   Sector repair completed: port=1, LBA=0x263C5D3
c0   [Sat Jun  9 04:07:32 2012]  WARNING   Sector repair completed: port=0, LBA=0x27D3E12
c0   [Sat Jun  9 04:07:57 2012]  WARNING   Sector repair completed: port=0, LBA=0x27D3E15
c0   [Sat Jun  9 04:08:16 2012]  WARNING   Sector repair completed: port=0, LBA=0x27D3E17
c0   [Sat Jun  9 04:08:45 2012]  WARNING   Sector repair completed: port=0, LBA=0x27D3E19
c0   [Sat Jun  9 04:15:04 2012]  WARNING   Sector repair completed: port=0, LBA=0x2A07B64
c0   [Sat Jun  9 04:15:26 2012]  WARNING   Sector repair completed: port=0, LBA=0x2A07B66
c0   [Sat Jun  9 04:15:45 2012]  WARNING   Sector repair completed: port=0, LBA=0x2A07B68
c0   [Sat Jun  9 04:15:59 2012]  WARNING   Sector repair completed: port=0, LBA=0x2A07B6C
c0   [Sat Jun  9 04:16:13 2012]  WARNING   Sector repair completed: port=0, LBA=0x2A07B6E
c0   [Sat Jun  9 21:48:52 2012]  INFO      Verify completed: unit=0
c0   [Thu Jun 14 00:40:10 2012]  WARNING   Sector repair completed: port=0, LBA=0x334F14B
c0   [Sat Jun 16 00:01:38 2012]  INFO      Verify started: unit=0
c0   [Sat Jun 16 21:16:19 2012]  INFO      Verify completed: unit=0
c0   [Tue Jun 19 02:03:43 2012]  WARNING   Sector repair completed: port=1, LBA=0xFE41EAD
c0   [Wed Jun 20 02:30:02 2012]  WARNING   Sector repair completed: port=1, LBA=0xD99145C
c0   [Sat Jun 23 00:01:36 2012]  INFO      Verify started: unit=0
c0   [Sat Jun 23 04:27:04 2012]  WARNING   Sector repair completed: port=1, LBA=0x2FAD311
c0   [Sat Jun 23 06:52:38 2012]  WARNING   Sector repair completed: port=1, LBA=0x7C6AC8D
c0   [Sat Jun 23 06:53:03 2012]  WARNING   Sector repair completed: port=1, LBA=0x7C6AC91
c0   [Sat Jun 23 06:53:21 2012]  WARNING   Sector repair completed: port=1, LBA=0x7C6AC94
c0   [Sat Jun 23 17:00:22 2012]  WARNING   Sector repair completed: port=1, LBA=0xF9AC7C9
c0   [Sat Jun 23 21:15:19 2012]  INFO      Verify completed: unit=0
c0   [Sat Jun 30 00:01:34 2012]  INFO      Verify started: unit=0
c0   [Sat Jun 30 05:24:13 2012]  WARNING   Sector repair completed: port=0, LBA=0x3FAA9E7
c0   [Sat Jun 30 14:49:39 2012]  WARNING   Sector repair completed: port=1, LBA=0x869931C
c0   [Sat Jun 30 21:31:05 2012]  INFO      Verify completed: unit=0
c0   [Tue Jul  3 03:40:25 2012]  WARNING   Sector repair completed: port=1, LBA=0xD36C7F7
c0   [Fri Jul  6 02:50:18 2012]  WARNING   Sector repair completed: port=1, LBA=0x3562470
c0   [Fri Jul  6 22:18:26 2012]  WARNING   Sector repair completed: port=1, LBA=0x3563173
c0   [Sat Jul  7 00:01:31 2012]  INFO      Verify started: unit=0
c0   [Sat Jul  7 00:50:16 2012]  WARNING   Sector repair completed: port=0, LBA=0x76EE88
c0   [Sat Jul  7 00:50:39 2012]  WARNING   Sector repair completed: port=0, LBA=0x76EE8F
c0   [Sat Jul  7 21:39:36 2012]  INFO      Verify completed: unit=0
c0   [Sun Jul  8 02:51:05 2012]  WARNING   Sector repair completed: port=0, LBA=0x67759D
c0   [Sun Jul  8 02:53:55 2012]  WARNING   Sector repair completed: port=0, LBA=0x67759B
c0   [Tue Jul 10 16:17:21 2012]  WARNING   Sector repair completed: port=0, LBA=0x15C8C695
c0   [Wed Jul 11 22:51:22 2012]  WARNING   Sector repair completed: port=1, LBA=0x355BBD0
c0   [Sat Jul 14 00:01:28 2012]  INFO      Verify started: unit=0
c0   [Sat Jul 14 01:33:40 2012]  WARNING   Sector repair completed: port=1, LBA=0x1333BCF4
c0   [Sat Jul 14 03:36:23 2012]  WARNING   Sector repair completed: port=1, LBA=0x2174773
c0   [Sat Jul 14 11:26:44 2012]  WARNING   Sector repair completed: port=0, LBA=0x7429AB7
c0   [Sat Jul 14 16:53:50 2012]  WARNING   Sector repair completed: port=1, LBA=0xA17EB3F
c0   [Sat Jul 14 21:19:25 2012]  INFO      Verify completed: unit=0
c0   [Wed Jul 18 05:08:47 2012]  WARNING   Sector repair completed: port=1, LBA=0x17D62EDC
c0   [Wed Jul 18 05:14:15 2012]  WARNING   Sector repair completed: port=1, LBA=0x17D62EE1
c0   [Thu Jul 19 03:24:59 2012]  WARNING   Sector repair completed: port=0, LBA=0x7733C3D
c0   [Thu Jul 19 03:25:20 2012]  WARNING   Sector repair completed: port=0, LBA=0x773CEA5
c0   [Thu Jul 19 03:28:16 2012]  WARNING   Sector repair completed: port=0, LBA=0x7733C42
c0   [Thu Jul 19 03:28:41 2012]  WARNING   Sector repair completed: port=0, LBA=0x773CEAF
c0   [Sat Jul 21 00:01:26 2012]  INFO      Verify started: unit=0
c0   [Sat Jul 21 03:07:30 2012]  WARNING   Sector repair completed: port=1, LBA=0x1CC6936
c0   [Sat Jul 21 03:07:52 2012]  WARNING   Sector repair completed: port=1, LBA=0x1CC6938
c0   [Sat Jul 21 03:08:11 2012]  WARNING   Sector repair completed: port=1, LBA=0x1CC693A
c0   [Sat Jul 21 16:43:56 2012]  WARNING   Sector repair completed: port=0, LBA=0xD04C914
c0   [Sat Jul 21 16:45:31 2012]  WARNING   Sector repair completed: port=1, LBA=0xD456973
c0   [Sat Jul 21 21:14:29 2012]  INFO      Verify completed: unit=0
c0   [Wed Jul 25 03:37:25 2012]  WARNING   Sector repair completed: port=0, LBA=0x1F8E6C43
c0   [Sat Jul 28 00:01:24 2012]  INFO      Verify started: unit=0
c0   [Sat Jul 28 01:45:27 2012]  WARNING   Sector repair completed: port=0, LBA=0x11584AD
c0   [Sat Jul 28 18:54:25 2012]  WARNING   Sector repair completed: port=1, LBA=0x447C3E6C
c0   [Sat Jul 28 21:13:46 2012]  INFO      Verify completed: unit=0
c0   [Wed Aug  1 03:20:11 2012]  WARNING   Sector repair completed: port=0, LBA=0x805FEF
c0   [Fri Aug  3 00:50:03 2012]  WARNING   Sector repair completed: port=0, LBA=0xCED0ACA
c0   [Sat Aug  4 00:01:22 2012]  INFO      Verify started: unit=0
c0   [Sat Aug  4 00:52:51 2012]  WARNING   Sector repair completed: port=0, LBA=0x805FF3
c0   [Sat Aug  4 00:53:14 2012]  WARNING   Sector repair completed: port=0, LBA=0x805FF5
c0   [Sat Aug  4 00:53:33 2012]  WARNING   Sector repair completed: port=0, LBA=0x805FF7
c0   [Sat Aug  4 00:53:47 2012]  WARNING   Sector repair completed: port=0, LBA=0x805FF9
c0   [Sat Aug  4 00:54:00 2012]  WARNING   Sector repair completed: port=0, LBA=0x805FFB
c0   [Sat Aug  4 00:54:14 2012]  WARNING   Sector repair completed: port=0, LBA=0x805FFD
c0   [Sat Aug  4 00:54:27 2012]  WARNING   Sector repair completed: port=0, LBA=0x805FFF
c0   [Sat Aug  4 04:43:12 2012]  WARNING   Sector repair completed: port=1, LBA=0x16974289
c0   [Sat Aug  4 04:58:17 2012]  WARNING   Sector repair completed: port=1, LBA=0x1697428E
c0   [Sat Aug  4 20:54:53 2012]  INFO      Verify completed: unit=0
c0   [Wed Aug  8 03:21:55 2012]  ERROR     Drive timeout detected: port=1
c0   [Wed Aug  8 15:31:44 2012]  WARNING   Sector repair completed: port=0, LBA=0x1A366CD3
c0   [Sat Aug 11 00:01:21 2012]  INFO      Verify started: unit=0
c0   [Sat Aug 11 20:40:51 2012]  INFO      Verify completed: unit=0
c0   [Thu Aug 16 05:10:55 2012]  WARNING   Sector repair completed: port=0, LBA=0x1C22593
c0   [Sat Aug 18 00:01:18 2012]  INFO      Verify started: unit=0
c0   [Sat Aug 18 03:00:20 2012]  WARNING   Sector repair completed: port=0, LBA=0x1C225A5
c0   [Sat Aug 18 03:43:00 2012]  WARNING   Sector repair completed: port=1, LBA=0x23EE91E
c0   [Sat Aug 18 03:43:23 2012]  WARNING   Sector repair completed: port=1, LBA=0x23EE920
c0   [Sat Aug 18 17:00:06 2012]  WARNING   Sector repair completed: port=1, LBA=0x137D066A
c0   [Sat Aug 18 17:00:29 2012]  WARNING   Sector repair completed: port=1, LBA=0x137D066D
c0   [Sat Aug 18 21:13:01 2012]  INFO      Verify completed: unit=0
c0   [Wed Aug 22 01:36:08 2012]  WARNING   Sector repair completed: port=0, LBA=0x2560A0F
c0   [Wed Aug 22 01:37:42 2012]  WARNING   Sector repair completed: port=0, LBA=0x2560A13
c0   [Fri Aug 24 04:01:36 2012]  WARNING   Sector repair completed: port=1, LBA=0x55C1A5DF
c0   [Fri Aug 24 05:02:06 2012]  WARNING   Sector repair completed: port=1, LBA=0xCE3378A
c0   [Sat Aug 25 00:01:17 2012]  INFO      Verify started: unit=0
c0   [Sat Aug 25 00:31:06 2012]  WARNING   Sector repair completed: port=1, LBA=0x50F65D
c0   [Sat Aug 25 00:39:52 2012]  WARNING   Sector repair completed: port=0, LBA=0x678FF4
c0   [Sat Aug 25 03:43:15 2012]  WARNING   Sector repair completed: port=0, LBA=0x2560A15
c0   [Sat Aug 25 03:43:39 2012]  WARNING   Sector repair completed: port=0, LBA=0x2560A19
c0   [Sat Aug 25 03:43:58 2012]  WARNING   Sector repair completed: port=0, LBA=0x2560A1B
c0   [Sat Aug 25 03:44:30 2012]  WARNING   Sector repair completed: port=0, LBA=0x2560A21
c0   [Sat Aug 25 20:58:14 2012]  INFO      Verify completed: unit=0
c0   [Wed Aug 29 04:57:15 2012]  WARNING   Sector repair completed: port=1, LBA=0xF3957EB
c0   [Sat Sep  1 00:01:15 2012]  INFO      Verify started: unit=0
c0   [Sat Sep  1 03:21:52 2012]  WARNING   Sector repair completed: port=0, LBA=0x1DAFC86
c0   [Sat Sep  1 03:22:15 2012]  WARNING   Sector repair completed: port=0, LBA=0x1DAFC88
c0   [Sat Sep  1 03:22:34 2012]  WARNING   Sector repair completed: port=0, LBA=0x1DAFC8A
c0   [Sat Sep  1 03:22:47 2012]  WARNING   Sector repair completed: port=0, LBA=0x1DAFC8C
c0   [Sat Sep  1 17:17:22 2012]  WARNING   Sector repair completed: port=0, LBA=0xF917FD1
c0   [Sat Sep  1 17:17:45 2012]  WARNING   Sector repair completed: port=0, LBA=0xF917FD3
c0   [Sat Sep  1 17:18:04 2012]  WARNING   Sector repair completed: port=0, LBA=0xF917FD5
c0   [Sat Sep  1 21:36:56 2012]  INFO      Verify completed: unit=0
c0   [Thu Sep  6 00:07:30 2012]  WARNING   Sector repair completed: port=0, LBA=0xDA3C64B
c0   [Thu Sep  6 00:32:56 2012]  WARNING   Sector repair completed: port=1, LBA=0x6BBA816
c0   [Sat Sep  8 00:01:13 2012]  INFO      Verify started: unit=0
c0   [Sat Sep  8 00:09:56 2012]  WARNING   Sector repair completed: port=0, LBA=0xDEBC958
c0   [Sat Sep  8 04:38:45 2012]  WARNING   Sector repair completed: port=0, LBA=0x38D254F
c0   [Sat Sep  8 20:44:50 2012]  INFO      Verify completed: unit=0
c0   [Mon Sep 10 01:26:34 2012]  WARNING   Sector repair completed: port=1, LBA=0xFFD8D5E
c0   [Wed Sep 12 00:33:48 2012]  WARNING   Sector repair completed: port=1, LBA=0xE8DB928
c0   [Wed Sep 12 00:36:33 2012]  WARNING   Sector repair completed: port=1, LBA=0x6D49411
c0   [Fri Sep 14 01:59:39 2012]  WARNING   Sector repair completed: port=0, LBA=0x1467F1C
c0   [Fri Sep 14 02:08:27 2012]  WARNING   Sector repair completed: port=1, LBA=0x14C8ABD
c0   [Fri Sep 14 03:54:47 2012]  WARNING   Sector repair completed: port=0, LBA=0x1580C915
c0   [Sat Sep 15 00:01:11 2012]  INFO      Verify started: unit=0
c0   [Sat Sep 15 02:38:14 2012]  WARNING   Sector repair completed: port=0, LBA=0x1C178973
c0   [Sat Sep 15 02:59:02 2012]  WARNING   Sector repair completed: port=0, LBA=0x1C178975
c0   [Sat Sep 15 04:47:08 2012]  WARNING   Sector repair completed: port=0, LBA=0x3FA0356
c0   [Sat Sep 15 04:47:31 2012]  WARNING   Sector repair completed: port=0, LBA=0x3FA0359
c0   [Sat Sep 15 10:41:59 2012]  WARNING   Sector repair completed: port=0, LBA=0x6DFD1EC
c0   [Sat Sep 15 13:25:23 2012]  WARNING   Sector repair completed: port=0, LBA=0x7CBD100
c0   [Sat Sep 15 13:25:31 2012]  WARNING   Sector repair completed: port=0, LBA=0x7CBD104
c0   [Sat Sep 15 13:25:54 2012]  WARNING   Sector repair completed: port=0, LBA=0x7CBD106
c0   [Sat Sep 15 17:10:50 2012]  WARNING   Sector repair completed: port=0, LBA=0x1C178977
c0   [Sat Sep 15 20:59:57 2012]  INFO      Verify completed: unit=0
c0   [Tue Sep 18 01:17:18 2012]  WARNING   Sector repair completed: port=1, LBA=0x803B05B
c0   [Sat Sep 22 00:01:10 2012]  INFO      Verify started: unit=0
c0   [Sat Sep 22 20:54:31 2012]  INFO      Verify completed: unit=0
c0   [Tue Sep 25 01:56:47 2012]  WARNING   Sector repair completed: port=0, LBA=0x26E3909
c0   [Sat Sep 29 00:01:08 2012]  INFO      Verify started: unit=0
c0   [Sat Sep 29 02:04:14 2012]  WARNING   Sector repair completed: port=0, LBA=0x146AC03
c0   [Sat Sep 29 10:58:39 2012]  WARNING   Sector repair completed: port=0, LBA=0x6D4EB0E
c0   [Sat Sep 29 10:59:02 2012]  WARNING   Sector repair completed: port=0, LBA=0x6D4EB14
c0   [Sat Sep 29 11:22:44 2012]  WARNING   Sector repair completed: port=0, LBA=0x6F79623
c0   [Sat Sep 29 13:50:48 2012]  WARNING   Sector repair completed: port=1, LBA=0x7D1D65E
c0   [Sat Sep 29 13:51:11 2012]  WARNING   Sector repair completed: port=1, LBA=0x7D1D661
c0   [Sat Sep 29 13:51:30 2012]  WARNING   Sector repair completed: port=1, LBA=0x7D1D663
c0   [Sat Sep 29 20:57:34 2012]  INFO      Verify completed: unit=0
c0   [Mon Oct  1 04:47:24 2012]  WARNING   Sector repair completed: port=0, LBA=0xC5BC6F2
c0   [Tue Oct  2 02:00:27 2012]  WARNING   Sector repair completed: port=0, LBA=0x1547667
c0   [Tue Oct  2 02:01:56 2012]  WARNING   Sector repair completed: port=0, LBA=0x154766F
c0   [Tue Oct  2 05:02:31 2012]  WARNING   Sector repair completed: port=1, LBA=0xD67D054
c0   [Tue Oct  2 05:04:14 2012]  WARNING   Sector repair completed: port=1, LBA=0xD67D056
c0   [Wed Oct  3 01:22:12 2012]  WARNING   Sector repair completed: port=1, LBA=0x12AAF8CA
c0   [Thu Oct  4 04:29:22 2012]  WARNING   Sector repair completed: port=0, LBA=0x13E6F992
c0   [Thu Oct  4 05:10:51 2012]  WARNING   Sector repair completed: port=0, LBA=0x1C252A4
c0   [Sat Oct  6 00:01:07 2012]  INFO      Verify started: unit=0
c0   [Sat Oct  6 19:41:18 2012]  WARNING   Sector repair completed: port=1, LBA=0x5A5C3AE8
c0   [Sat Oct  6 21:01:05 2012]  INFO      Verify completed: unit=0
c0   [Mon Oct  8 00:32:06 2012]  WARNING   Sector repair completed: port=0, LBA=0x6C60D3E
c0   [Tue Oct  9 03:51:03 2012]  WARNING   Sector repair completed: port=1, LBA=0x89B5EC9
c0   [Thu Oct 11 04:21:17 2012]  WARNING   Sector repair completed: port=1, LBA=0x13F85833
c0   [Sat Oct 13 00:01:05 2012]  INFO      Verify started: unit=0
c0   [Sat Oct 13 05:12:40 2012]  WARNING   Sector repair completed: port=0, LBA=0x3FA5134
c0   [Sat Oct 13 21:08:35 2012]  INFO      Verify completed: unit=0
c0   [Tue Oct 16 03:53:50 2012]  WARNING   Sector repair completed: port=1, LBA=0x148AA1BD
c0   [Thu Oct 18 03:20:30 2012]  WARNING   Sector repair completed: port=0, LBA=0x1C8DABCB
c0   [Thu Oct 18 04:52:50 2012]  WARNING   Sector repair completed: port=0, LBA=0xE879057
c0   [Sat Oct 20 00:01:04 2012]  INFO      Verify started: unit=0
c0   [Sat Oct 20 02:19:25 2012]  WARNING   Sector repair completed: port=1, LBA=0x174B012
c0   [Sat Oct 20 03:41:38 2012]  WARNING   Sector repair completed: port=0, LBA=0x256D93B
c0   [Sat Oct 20 03:42:01 2012]  WARNING   Sector repair completed: port=0, LBA=0x256D93D
c0   [Sat Oct 20 03:42:40 2012]  WARNING   Sector repair completed: port=0, LBA=0x256D940
c0   [Sat Oct 20 03:42:59 2012]  WARNING   Sector repair completed: port=0, LBA=0x256D942
c0   [Sat Oct 20 03:43:12 2012]  WARNING   Sector repair completed: port=0, LBA=0x256D944
c0   [Sat Oct 20 03:43:26 2012]  WARNING   Sector repair completed: port=0, LBA=0x256D948
c0   [Sat Oct 20 16:37:52 2012]  WARNING   Sector repair completed: port=0, LBA=0xE879060
c0   [Sat Oct 20 16:38:15 2012]  WARNING   Sector repair completed: port=0, LBA=0xE879062
c0   [Sat Oct 20 21:00:18 2012]  INFO      Verify completed: unit=0
c0   [Sat Oct 20 23:49:01 2012]  WARNING   Sector repair completed: port=1, LBA=0x4473E908
c0   [Sun Oct 21 03:42:26 2012]  WARNING   Sector repair completed: port=1, LBA=0x175BADD5
c0   [Tue Oct 23 01:09:04 2012]  WARNING   Sector repair completed: port=1, LBA=0x6E524860
c0   [Fri Oct 26 03:21:25 2012]  WARNING   Sector repair completed: port=0, LBA=0x802C61
c0   [Fri Oct 26 04:22:21 2012]  WARNING   Sector repair completed: port=0, LBA=0x176353CD
c0   [Sat Oct 27 00:01:03 2012]  INFO      Verify started: unit=0
c0   [Sat Oct 27 00:49:35 2012]  WARNING   Sector repair completed: port=0, LBA=0x802C65
c0   [Sat Oct 27 17:02:24 2012]  WARNING   Sector repair completed: port=1, LBA=0xC1FF26D
c0   [Sat Oct 27 17:09:06 2012]  WARNING   Sector repair completed: port=0, LBA=0xDF621AD
c0   [Sat Oct 27 21:30:57 2012]  INFO      Verify completed: unit=0
c0   [Tue Oct 30 00:20:46 2012]  WARNING   Sector repair completed: port=0, LBA=0xE9FE2AB
c0   [Wed Oct 31 02:02:03 2012]  WARNING   Sector repair completed: port=0, LBA=0x1460C25
c0   [Wed Oct 31 02:04:05 2012]  WARNING   Sector repair completed: port=0, LBA=0x1460C28
c0   [Thu Nov  1 00:48:34 2012]  WARNING   Sector repair completed: port=1, LBA=0xA7C92BE
c0   [Thu Nov  1 05:04:45 2012]  WARNING   Sector repair completed: port=0, LBA=0x1C252C2

[root@backup1 /data/deprecated]#

Look for failed drives and errors. Obviously from the above we need to probably replace drives 0 and 1 and drive 1 is even showing as having problems, yet the RAID array is healthy, amazingly. You also see the automatic verifies.

Note: when rebuilding a degraded mirror, you will see no progress as it rebuilds in the cli

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    REBUILDING     0       -       64K     4656.56   OFF    ON

areca[edit]

We are using an areca controller on backup3.

[root@newbackup3 ~]# sh /root/verify.sh
  # Name             Raid Name       Level   Capacity Ch/Id/Lun  State
===============================================================================
  1 ARC-1160-VOL#00  Raid Set # 00   Raid5   5000.0GB 00/00/00   Checking(19.7%)
===============================================================================
GuiErrMsg<0x00>: Success.
 #  Name             Disks TotalCap  FreeCap DiskChannels       State
===============================================================================
 1  Raid Set # 00        6 6000.0GB    0.0GB 123456             Checking
===============================================================================
GuiErrMsg<0x00>: Success.
Date-Time            Device           Event Type            Elapsed Time Errors
===============================================================================
2012-12-05 20:40:58  ARC-1160-VOL#00  Start Checking
2012-12-01 05:06:04  ARC-1160-VOL#00  Complete Init         027:30:45
2012-11-30 01:35:19  ARC-1160-VOL#00  Start Initialize
2026-08-06 01:34:52  H/W MONITOR      Raid Powered On
2012-11-30 01:33:36  ARC-1160-VOL#00  Stop Initialization   000:31:48
2012-11-30 01:01:47  ARC-1160-VOL#00  Start Initialize
2026-08-06 00:58:13  H/W MONITOR      Raid Powered On
2012-11-30 00:57:26  ARC-1160-VOL#00  Stop Initialization   000:57:07
2012-11-30 00:00:19  ARC-1160-VOL#00  Start Initialize
2026-08-05 23:56:48  H/W MONITOR      Raid Powered On
2026-08-05 23:52:58  H/W MONITOR      Raid Powered On
2026-08-05 23:50:14  H/W MONITOR      Raid Powered On
2026-08-05 23:43:30  H/W MONITOR      Raid Powered On
2012-11-29 23:10:07  ARC-1160-VOL#00  Stop Initialization   000:00:56
2012-11-29 23:09:11  ARC-1160-VOL#00  Start Initialize
2026-08-05 23:08:57  H/W MONITOR      Raid Powered On
2012-11-29 23:08:10  ARC-1160-VOL#00  Stop Initialization   000:20:41
2012-11-29 22:47:29  ARC-1160-VOL#00  Start Initialize
2026-08-05 22:46:59  H/W MONITOR      Raid Powered On
2026-08-05 22:45:55  H/W MONITOR      Raid Powered On
2026-08-05 22:44:53  H/W MONITOR      Raid Powered On
2026-08-05 22:42:06  H/W MONITOR      Raid Powered On
2026-08-05 22:40:50  H/W MONITOR      Raid Powered On
2012-11-29 22:40:04  ARC-1160-VOL#00  Stop Initialization   000:24:25
2012-11-29 22:15:38  ARC-1160-VOL#00  Start Initialize
2026-08-05 22:15:11  000:000001215B00 Restart Init LBA Point
2026-08-05 22:15:10  H/W MONITOR      Raid Powered On
2012-11-29 21:56:38  ARC-1160-VOL#00  Start Initialize
2026-08-05 21:56:12  H/W MONITOR      Raid Powered On
2026-08-05 21:56:04  IDE Channel #03  Device Inserted
2012-11-29 21:55:13  IDE Channel #04  Device Inserted
2012-11-29 21:55:03  IDE Channel #02  Device Inserted
2026-08-05 21:53:09  H/W MONITOR      Raid Powered On
2026-08-05 20:51:46  H/W MONITOR      Raid Powered On
2026-08-05 20:49:56  H/W MONITOR      Raid Powered On
2026-08-05 20:48:29  H/W MONITOR      Raid Powered On
2026-08-05 20:46:29  H/W MONITOR      Raid Powered On
2026-08-05 20:44:49  H/W MONITOR      Raid Powered On
2026-08-05 20:43:01  H/W MONITOR      Raid Powered On
2026-08-05 20:36:25  H/W MONITOR      Raid Powered On
2026-08-05 20:31:18  H/W MONITOR      Raid Powered On
2026-08-05 20:30:08  H/W MONITOR      Raid Powered On
2026-08-05 20:08:40  H/W MONITOR      Raid Powered On
2026-08-05 20:06:11  H/W MONITOR      Raid Powered On
2026-08-05 20:05:14  H/W MONITOR      Raid Powered On
2026-08-05 20:03:58  H/W MONITOR      Raid Powered On
2026-08-05 20:00:56  H/W MONITOR      Raid Powered On
2026-08-05 19:57:57  H/W MONITOR      Raid Powered On
2026-08-05 19:56:15  H/W MONITOR      Raid Powered On
2026-08-05 19:55:05  H/W MONITOR      Raid Powered On
2026-08-05 17:24:36  H/W MONITOR      Raid Powered On
2026-08-05 17:22:43  H/W MONITOR      Raid Powered On
2026-08-05 04:50:42  H/W MONITOR      Raid Powered On
2026-08-05 04:47:33  H/W MONITOR      Raid Powered On
2026-08-05 04:43:57  H/W MONITOR      Raid Powered On
2026-08-05 04:18:52  H/W MONITOR      Raid Powered On
2026-08-05 04:17:30  H/W MONITOR      Raid Powered On
2026-08-05 04:13:30  H/W MONITOR      Raid Powered On
2026-08-05 04:10:26  H/W MONITOR      Raid Powered On
2026-08-05 04:09:23  H/W MONITOR      Raid Powered On
2026-08-05 00:08:09  H/W MONITOR      Raid Powered On
2026-08-05 00:07:12  H/W MONITOR      Raid Powered On
2026-08-05 00:05:51  H/W MONITOR      Raid Powered On
2026-08-05 00:04:27  H/W MONITOR      Raid Powered On
===============================================================================
GuiErrMsg<0x00>: Success.
press enter when ready to run verify

Look for failed drives and errors.

When it proceed's to verifying, you can confirm with:

[root@newbackup3 ~]# cli64 vsf info
  # Name             Raid Name       Level   Capacity Ch/Id/Lun  State
===============================================================================
  1 ARC-1160-VOL#00  Raid Set # 00   Raid5   5000.0GB 00/00/00   Checking(22.5%)
===============================================================================
GuiErrMsg<0x00>: Success.
[root@newbackup3 ~]#

Update OS list[edit]

check for any new VZ templates we want to offer: vzup2date -z
see if there's any OS's we want to include in our colo install list. Update 2 places: signup/html/colo_quote.html & signup/html/step1.html
update the mgmt database (ref_templates table, ref_systems table).

Infrequent tasks[edit]

Free up space on gateway[edit]

newgateway /var/spool# cd clientmqueue/
newgateway /var/spool/clientmqueue# sh
# for f in `ls`; do rm $f; done
exit

Free up space on mail[edit]

You can clear out root mail:

mail /var/log# ll -h /var/mail/root
-rw-------  1 root  mail    543K Dec 19 13:05 /var/mail/root
mail /var/log# rm /var/mail/root

Or you can archive mail logs:

mail /var/log# ls -l htt*
-rw-r--r--  1 root  wheel  297436931 Dec 19 13:26 httpd-access.log
-rw-r--r--  1 root  wheel    9824324 Jul  4 11:34 httpd-access.log.old.0.gz
-rw-r--r--  1 root  wheel    6884137 Mar 17  2012 httpd-access.log.old.1.gz
-rw-r--r--  1 root  wheel   18557444 Dec  3  2009 httpd-access.log.old.10.gz
-rw-r--r--  1 root  wheel   14740263 Jan  9  2007 httpd-access.log.old.11.gz
-rw-r--r--  1 root  wheel   14209465 Nov 28  2007 httpd-access.log.old.12.gz
-rw-r--r--  1 root  wheel   16874396 Feb 19  2012 httpd-access.log.old.3.gz
-rw-r--r--  1 root  wheel   14554859 Jul 22  2011 httpd-access.log.old.4.gz
-rw-r--r--  1 root  wheel   10513227 Feb 18  2011 httpd-access.log.old.5.gz
-rw-r--r--  1 root  wheel    7201946 Oct 29  2010 httpd-access.log.old.6.gz
-rw-r--r--  1 root  wheel   10062537 May  6  2010 httpd-access.log.old.7.gz
-rw-r--r--  1 root  wheel   10157042 Aug 12  2010 httpd-access.log.old.8.gz
-rw-r--r--  1 root  wheel   11909534 Mar  4  2010 httpd-access.log.old.9.gz
-rw-r--r--  1 root  wheel   59030930 Dec 19 13:01 httpd-error.log
-rw-r--r--  1 root  wheel    3413134 Mar  4  2010 httpd-error.log.0.gz
-rw-r--r--  1 root  wheel     795515 May  1  2007 httpd-error.log.1.gz
-rw-r--r--  1 root  wheel    1142153 Nov 30  2007 httpd-error.log.2.gz
-rw-r--r--  1 root  wheel    2325801 Feb 18  2011 httpd-error.log.gz

mail /var/log# sh
# for f in 12 11 10 9 8 7 6 5 4 3 2 1 0; do g=`echo $f+1|bc`; mv httpd-access.log.old.$f.gz httpd-access.log.old.$g.gz; done
# mv httpd-access.log httpd-access.log.old.0
# touch httpd-access.log
# apachectl restart
# gzip httpd-access.log.old.0

# for f in 2 1 0; do g=`echo $f+1|bc`; mv httpd-error.log.$f.gz httpd-error.log.$g.gz; done
# mv httpd-error.log httpd-error.log.0
# touch httpd-error.log
# apachectl restart
# gzip httpd-error.log.0
# exit

Free up space on bwdb2[edit]

You can either remove items from /usr/home/archive or you can scp them to backup3:/data/bwdb2/archive .

Free up space on backup1[edit]

backup1 is our primary customer backup system. As usage grows over time, it needs to be regularly purged of old files. The easiest way to do this is by removing deprecated files. These mostly consist of cancelled customers or temporary dump/storage files (created during dump/restores). Our standard policy is to hang onto cancelled customers for 6mos after which we remove their files (as far as customers know their data is purged immediately, but we hang onto it just in case.. and in some cases we cancel a server due to non payment so this makes it easy to restore their system). To find files to remove:

[root@backup1 ~]# cd /data/deprecated/
[root@backup1 /data/deprecated]# ls
2101-migrated-20120317.tgz                old-683-cxld-20121021.tgz
69.55.230.2-wwwbackup                     old-744-cxld-20120708.tgz
991-DONTDELETE.tgz                        old-809-cxld-20120609.tgz
archive-col02050-mdfile-cxld-20120409.gz  old-854-cxld-20120621.tgz
col01371.tgz                              old-931-cxld-20060513.tgz
deleteme_ubuntu-10.10-x86_20111205        old-col00123-mdfile-noarchive-20120417.gz
jail10_old                                old-col00147-vnfile-cxld-20120828.gz
jail14_rsync_old                          old-col00419-dump-cxld-20120224.gz
jail15_old                                old-col01098-vnfile-cxld-20120827.gz
jail3_old                                 old-col01278-dump-cxld-20120822
jail4_old                                 old-col01517-dump-cxld-20120828
jail5_old                                 old-col01669-dump-cxld-20120203.gz
old-1009-cxld-20120608.tgz                old-col01687-dump-cxld-20120909
old-1012-cxld-20120411.tgz                old-col01790-dump-cxld-20120828
old-1052-cxld-20120721.tgz                old-col01812-dump-cxld-20120820
old-10631-cxld-20120622.tgz               old-col01938-mdfile-cxld-20120619.gz
old-10632-cxld-20120622.tgz               old-col02095-mdfile-noarchive-20120523.gz
old-10633-cxld-20120622.tgz               olddebian-3.0-v15-20110610.tgz
old-1236-cxld-20120621.tgz                oldmod_frontpage-deb30-v15-20110610.tgz
old-1381-cxld-20120404.tgz                oldmod_perl-deb30-v15-20110610.tgz
old-1422-cxld-20120721.tgz                oldmod_ssl-deb30-v15-20110610.tgz
old-14681-cxld-20120619.tgz               oldmysql-deb30-v15-20110610.tgz
old-1544-cxld-20120626.tgz                oldproftpd-deb30-v15-20110610.tgz
old-18351-cxld-20120605.tgz               old_virt14
old-1853-cxld-20120910.tgz                old_virt18
old-1963-cxld-20120206.tgz                oldwebmin-deb30-v15-20110610.tgz
old-1967-cxld-20120605.tgz                suse.virt11.20120421.tgz
old-1981-noarchive-20120729.tgz           virt11
old-2030-migrated-noarchive-20120727.tgz  virt12_old
old-2037-cxld-20120716.tgz                virt13_old
old-2065-cxld-20120727.tgz                virt16_old
old-2068-cxld-20120424.tgz                virt4_old
old-2085-cxld-20120531.tgz                virt5_old
old-364-cxld-20120904.tgz                 virt6_old
old-446-cxld-20120512.tgz                 virt7_old
old-613-cxld-20120601.tgz                 virt8_old
[root@backup1 /data/deprecated]#

virtX_old and jailX_old are permanently archived, so ignore those as well as anything else marked not to delete or otherwise suspicious. Likewise, probably a good idea to try to hang onto oldTEMPLATE.gz as long as we can as well. Most of the stuff we want to delete is dated when it was deprecated, making this easy. So to remove files from 6 mos ago (running this in Oct):

[root@backup1 /data/deprecated]# ls old*201204*
old-1012-cxld-20120411.tgz  old-2068-cxld-20120424.tgz
old-1381-cxld-20120404.tgz  old-col00123-mdfile-noarchive-20120417.gz
[root@backup1 /data/deprecated]# rm old*201204*

Every few months you will also want to remove some of the snapshot archives for mail. We typically save the 1st, 10th, and 20th of each month. To do this you set aside the dates you want to save then remove months at a time, followed by restoring the set aside dates. Here's how that works:

[root@backup1 /data/www/daily]# ls
05                     08-10-11  10-04-10  11-10-10  12-07-29  12-09-21  12-11-14
06                     08-10-21  10-04-20  11-10-20  12-07-30  12-09-22  12-11-15
06-06-01-usr-home.tgz  08-11-01  10-05-01  11-11-01  12-07-31  12-09-23  12-11-16
06-07-01-usr-home.tgz  08-11-10  10-05-11  11-11-10  12-08-01  12-09-24  12-11-17
06-08-01-usr-home.tgz  08-11-20  10-05-20  11-11-20  12-08-02  12-09-25  12-11-18
06-09-01-usr-home.tgz  08-12-01  10-06-01  11-12-01  12-08-03  12-09-26  12-11-19
06-11-10               08-12-10  10-06-10  11-12-10  12-08-04  12-09-27  12-11-20
06-12-21               08-12-20  10-06-20  11-12-20  12-08-05  12-09-28  12-11-21
07-01-10               09-01-01  10-07-01  12-01-01  12-08-06  12-09-29  12-11-22
07-01-20               09-01-10  10-07-10  12-01-10  12-08-07  12-09-30  12-11-23
07-02-10               09-01-20  10-07-20  12-01-20  12-08-08  12-10-01  12-11-24
07-02-20               09-02-01  10-08-01  12-02-01  12-08-09  12-10-02  12-11-25
07-03-01               09-02-10  10-08-10  12-02-10  12-08-10  12-10-03  12-11-26
07-03-20               09-02-20  10-08-20  12-02-20  12-08-11  12-10-04  12-11-27
07-04-01               09-03-01  10-09-01  12-03-01  12-08-12  12-10-05  12-11-28
07-04-10               09-03-10  10-09-10  12-03-10  12-08-13  12-10-06  12-11-29
07-04-20               09-03-20  10-09-20  12-03-20  12-08-14  12-10-07  12-11-30
07-05-01               09-04-01  10-10-01  12-04-01  12-08-15  12-10-08  12-12-01
07-05-10               09-04-10  10-10-10  12-04-10  12-08-16  12-10-09  12-12-02
07-05-20               09-04-20  10-10-20  12-04-20  12-08-17  12-10-10  12-12-03
07-06-01               09-05-01  10-11-01  12-05-01  12-08-18  12-10-11  12-12-04
07-06-10               09-05-10  10-11-10  12-05-10  12-08-19  12-10-12  12-12-05
07-06-20               09-05-20  10-11-20  12-05-20  12-08-20  12-10-13  12-12-06
07-07-20               09-06-01  10-12-01  12-06-01  12-08-21  12-10-14  12-12-07
07-08-10               09-06-10  10-12-10  12-06-10  12-08-22  12-10-15  12-12-08
07-08-20               09-06-20  10-12-20  12-06-20  12-08-23  12-10-16  12-12-09
07-09-01               09-07-01  11-01-01  12-07-01  12-08-24  12-10-17  12-12-10
07-10-01               09-07-10  11-01-10  12-07-02  12-08-25  12-10-18  12-12-11
07-10-10               09-07-20  11-01-21  12-07-03  12-08-26  12-10-19  12-12-12
07-10-20               09-08-01  11-02-01  12-07-04  12-08-27  12-10-20  12-12-13
07-12-01               09-08-10  11-02-10  12-07-05  12-08-28  12-10-21  12-12-14
07-12-10               09-08-20  11-02-20  12-07-06  12-08-29  12-10-22  12-12-15
08-01-01               09-09-01  11-03-01  12-07-07  12-08-30  12-10-23  12-12-16
08-01-20               09-09-10  11-03-10  12-07-08  12-08-31  12-10-24  12-12-17
08-02-20               09-09-20  11-03-20  12-07-09  12-09-01  12-10-25  12-12-18
08-03-01               09-10-01  11-04-01  12-07-10  12-09-02  12-10-26  12-12-19
08-03-10               09-10-10  11-04-10  12-07-11  12-09-03  12-10-27  12-12-20
08-03-20               09-10-20  11-04-20  12-07-12  12-09-04  12-10-28  12-12-21
08-04-01               09-11-01  11-05-01  12-07-13  12-09-05  12-10-29  12-12-22
08-04-20               09-11-10  11-05-10  12-07-14  12-09-06  12-10-30  12-12-23
08-05-01               09-11-20  11-05-20  12-07-15  12-09-07  12-10-31  12-12-24
08-05-10               09-12-01  11-06-01  12-07-16  12-09-08  12-11-01  12-12-25
08-06-10               09-12-10  11-06-10  12-07-17  12-09-09  12-11-02  12-12-26
08-06-20               09-12-20  11-06-20  12-07-18  12-09-10  12-11-03  12-12-27
08-07-02               10-01-01  11-07-01  12-07-19  12-09-11  12-11-04  12-12-28
08-07-10               10-01-10  11-07-10  12-07-20  12-09-12  12-11-05  2008-10-23
08-07-20               10-01-20  11-07-20  12-07-21  12-09-13  12-11-06  bb.tgz
08-08-01               10-02-01  11-08-01  12-07-22  12-09-14  12-11-07  boot
08-08-10               10-02-10  11-08-10  12-07-23  12-09-15  12-11-08  current
08-08-21               10-02-20  11-08-20  12-07-24  12-09-16  12-11-09  hold
08-09-01               10-03-01  11-09-01  12-07-25  12-09-17  12-11-10
08-09-10               10-03-10  11-09-10  12-07-26  12-09-18  12-11-11
08-09-21               10-03-20  11-09-20  12-07-27  12-09-19  12-11-12
08-10-01               10-04-01  11-10-01  12-07-28  12-09-20  12-11-13
[root@backup1 /data/www/daily]#

So we see that everything up to July 2012 has been pruned. To prune July 2012 we do the following:

mv 12-07-01 hold
mv 12-07-10 hold
mv 12-07-20 hold
rm -fr 12-07*
mv hold/* .

Free up space on Other Servers[edit]

Many servers start to run out of disk space over time. Often it is caused by unread mail for root or log files.

To find the source of the problems, you use "du" to find where the disk space is being used. You can't do a du on /proc or /dev, so I use the command

[root@virt11 /]# du -hs [a-c]* deprecated [e-o]* [q-u]* var | tee duhs0

which produces something like this.

4.0K    backup
4.0K    backup1
4.0K    backup2
4.0K    backup3
4.0K    backup4
7.5M    bin
47M     boot
4.0K    deprecated
92M     etc
30M     home
8.0K    initrd
541M    lib
16K     lost+found
8.0K    media
0       misc
8.0K    mnt
0       net
92M     opt
336K    root
36M     sbin
8.0K    selinux
8.0K    srv
0       sys
4.0K    test
16K     tmp
1.2G    usr
583M    var

In this case it looks like /var is the problem, so

cd /var
du -hs * | tee duhs9

Produces

12K     account
2.6M    analog-5.32
63M     cache
24K     db
4.0K    duhs
4.0K    duhs1
4.0K    duhs2
4.0K    duhs3
4.0K    duhs4
4.0K    duhs5
4.0K    duhs6
4.0K    duhs7
4.0K    duhs8
32K     empty
8.0K    games
16K     kerberos
42M     lib
8.0K    local
36K     lock
457M    log
0       mail
8.0K    nis
8.0K    opt
8.0K    preserve
8.0K    racoon
240K    run
18M     spool
8.0K    tmp
64K     vz
0       vzagent
0       vzagent.tmp
16K     vzquota
1.2M    www
20K     yp

Usually, the problem is in /var/spool or /var/log, due to unread mail
or excessive log files.  You can continue to drill down by doing
a "cd <sbdirectory>" and another "du -hs *".

Routine Maintenance: Difference between revisions

Latest revision as of 11:39, 10 June 2020

Contents

Daily Tasks[edit]

check load graphs[edit]

Load averages jump at night[edit]

check backups[edit]

check bb for warnings[edit]

check jail5 for crashed VPSs[edit]

Check NetHere[edit]

Mail systems[edit]

Incoming[edit]

Outgoing[edit]

Nagios[edit]

Cacti[edit]

Monthly Tasks[edit]

rotate pine sent mail (1st of month)[edit]

b/w caps[edit]

Monthly RAID checks[edit]

Adaptec controllers[edit]

DELL (LSI-based) SAS controllers[edit]

LSI-based controllers (megaraid)[edit]

3ware[edit]

areca[edit]

Update OS list[edit]

Infrequent tasks[edit]

Free up space on gateway[edit]

Free up space on mail[edit]

Free up space on bwdb2[edit]

Free up space on backup1[edit]

Free up space on Other Servers[edit]

Navigation menu

@@ Line 1: / Line 1: @@
 = Daily Tasks =
-= Monthly Tasks =
+== check load graphs ==
-== Monthly RAID checks ==
+Click on the Load link in mgmt
+This screen shows you load levels on our servers and network traffic for critical machines (firewalls, backup servers).
+If you see load high or increasing
+FreeBSD:
+run [[VPS_Management#jtop|jtop]] (or [[VPS_Management#jt|jt]] > 7.x) and see if there are any runaway processes.  Here are some examples of entries in top that are
+definitely runaway processes:
+<pre>79481 root      64   0  2256K  1056K CPU1   1  58:16 87.40% 87.40% nano
+   1000    64   0  1852K  1112K RUN    0 207.9H 84.08% 84.08% screen
+www        2   0 39100K 31736K accept 0  104:24  46.54%  6.54% httpd
+root      61   0  1300K   844K RUN    1  47.8H 91.36% 91.36% ee
+www       56   0 18440K 10796K CPU1   0  64.4H 97.71% 97.71% httpd
+user      57   0  6124K  1160K CPU1   1  82.9H 98.44% 98.44% screen
+root      60   0  1352K   892K RUN    1  33.8H 65.82% 65.82% dialog
+   1000    64   0  3088K  2136K CPU0   0 806:13 97.95% 97.95% StutBot
+root      64   0  1396K   972K RUN    1  76.8H 86.47% 86.47% ee</pre>
+Linux:
+run [[VPS_Management#vwe|vwe]] to see which VPS’s have high loads. From there run <tt>[[VPS_Management#vp|vp]] <veid></tt> and/or <tt>[[VPS_Management#vt|vt]] <veid></tt> to see what's going on in that system.
+[[VPS_Management#vzstat|vzstat]] will also give you a nice picture of whats going on, systems with high numbers in the mlat column are likely culprits.
+examples of out of control procs:
+<pre>12183 nobody    16   0  4916 1348  1340 R    45.5  0.0  4249m httpd
+#502      16   0  1852  796   792 R    22.5  0.0  1104m vim
+#41       16   0  5472 5472  2076 R    98.9  0.2  31:41 python
+bin       19   0  1688  716   652 R    99.9  0.0 321:08 wtrs_ui
+apache    16   0   268  236   224 R    85.7  0.0  1010m ptrace
+#501      20   0  4304 2400  2044 R    53.6  0.1 284:32 YoSucker
+#506      20   0  1876  820   816 R    17.2  0.0 169:35 vim
+#514      20   0   900  724   672 R    77.6  0.0 382:30 neostats
+apache    14   0  3176 3176  1696 R    74.4  0.1   6:15 counter</pre>
+Just kill -9 them and be done with it.
+Also, anytime you see `kmod` or `ptrace` - kill those immediaely no matter how much they are using - they are attempts to exploit the linux ptrace bug.  They won't work, but they suck a lot of CPU...
+Also, any other processes that are at 90-100% cpu usage and have been running for any long period of time should be killed except for mysqld processes on FreeBSD.  See above.
+However, there is an exception:
+if it is a mysqld, we don't want to kill their database.  What you want to do is <tt>[[VPS_Management#jpid|jpid]] <pid></tt> to see who owns it, and then email them the paste containing the instructions for the nanny. Or you can simply do a <tt>kill -1 PID</tt> on the process to restart it.
+=== Load averages jump at night ===
+The load averages on the FreeBSD systems may jump up at night between 1 and 4 am - this is because the backups are running - if this is what is causing a jump in load, you will see processes like `rsync` in top eating a lot of CPU time.
+== check backups ==
+mgmt -> Motnroing -> Backups and make sure every machine was backed up the previous nite.
+Also look at df on backup1 and backup2 to make sure no disk is approaching full, though bb should warn us in advance. Please note - errors encountered when a backup script on any of the particular systems run will gnerate an email to support@johncompanies.com so you can know immediately the day after if the directory to be backed up has been moved or no longer exists. A paste exists for this to notify the customer of a non-existant file/dir.
+== check bb for warnings  ==
+mgmt -> BigBrother
+Some events don't generate pages (on purpose). You will only see them by going to the bb main page.
+== check jail5 for crashed VPSs  ==
+On jail5
+  notrunning
+To restart a VPS
+  vm restart col0XXXX
+== Check NetHere  ==
+Check the NetHere servers.  To get into the servers, login to
+admin-1.nethere.net and su - to root.
+=== Mail systems ===
+Check for possible SPAMMERs.
+==== Incoming ====
+Check mta-1 and mta-2 count of customer logins for possible
+customer SPAM compromises.
+  login_count /logs/maillog | tail -30
+==== Outgoing ====
+Check outgoing queues on relay-1 and relay-2
+  mail_count | tail -30
+To clean up outgoing queues of unwanted SPAM on relay-1 and relay-2.
+  mail_cleanup [ <sender's domain/username/message id> ... ]
+To just remove emails from some senders.
+  rmmails <sender's domain/username/message id> ...
+=== Nagios ===
+Check for other problems on NetHere servers
+  https://nagios.nethere.net
+=== Cacti ===
+Check bandwidth usage on servers
+  https://cacti.nethere.net
+= Monthly Tasks =
+== rotate pine sent mail (1st of month) ==
+On the 1st of the month, before any emails are sent out, quit out of pine, then log back in. Send mail from last month will be archived.
+If you mess up and do it on the 3rd (for example), you can go into the previous month's saved email and save emails from the current month into the <tt>sent-mail</tt> (current month) mailbox.
+== b/w caps ==
+On the 1st: remove any bwcaps put into the firewall (only really applies if a bwcap was added cause someone went over on b/w):
+ ipfw list|grep pipe
+ ipfw del [each rule listed]
+NOTE: this cronjob on newgateway will do some of that for you, provided you used one of the following pipe #s:
+0 1 * * /sbin/ipfw del 3  4 5 17331
+-------------------
+We really don’t do this anymore since we have centralized traffic accounting with netflow, but for posterity:
+Make sure all machines reset counters to 0 after midnight on the 1st
+Make sure they dumped a counter
+On each jail run:
+ trafficgather.pl
+And on each virt:
+ linuxtrafficgather.pl
+== Monthly RAID checks ==
+Every month we check the health of and verfy the parity on all our RAID-based systems.
+To facilitate this, we've created a simple script to start the process:
+ sh /root/verify.sh
+=== Adaptec controllers ===
+Here's some sample output:
+<pre>
+mail /usr/local/www/scripts# sh /root/verify.sh
+---------------------------------------------------------------------------------------------
+Adaptec SCSI RAID Controller Command Line Interface
+Copyright 1998-2002 Adaptec, Inc. All rights reserved
+---------------------------------------------------------------------------------------------
+CLI > open aac0
+Executing: open "aac0"
+AAC0> container list /f
+Executing: container list /full=TRUE
+Num          Total  Oth Chunk          Scsi   Partition
+Creation        System
+Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size   State   RO Lk Task    Done%  Ent
+Date   Time      Files
+----- ------ ------ --- ------ ------- ------ ------------- ------- -- -- ------- ------ ---
+------ -------- ------
+    Mirror 33.9GB            Open    0:01:0 64.0KB:33.9GB Normal                        0
+05:39:32
+ /dev/aacd0           mirror0          0:00:0 64.0KB:33.9GB Normal                        1
+05:39:32
+    Mirror 33.9GB            Open    0:02:0 64.0KB:33.9GB Normal                        0
+05:39:50
+ /dev/aacd1           mirror1          0:03:0 64.0KB:33.9GB Normal                        1
+05:39:50
+AAC0> disk list /f
+Executing: disk list /full=TRUE
+B:ID:L  Device Type     Removable media  Vendor-ID Product-ID        Rev   Blocks    Bytes/Bl
+ock Usage            Shared Rate
+------  --------------  ---------------  --------- ----------------  ----- --------- --------
+--- ---------------- ------ ----
+:00:0   Disk            N                FUJITSU   MAJ3364MC         3702  71390320  512
+     Initialized      NO     160
+:01:0   Disk            N                FUJITSU   MAJ3364MC         3702  71390320  512
+     Initialized      NO     160
+:02:0   Disk            N                FUJITSU   MAJ3364MC         3702  71390320  512
+     Initialized      NO     160
+:03:0   Disk            N                FUJITSU   MAJ3364MC         3702  71390320  512
+     Initialized      NO     160
-Every month we check the health of and verfy the parity on all our RAID-based systems.
+AAC0> disk show smart
-To facilitate this, we've created a simple script to start the process:
+Executing: disk show smart
-  sh /root/verify.sh
+        Smart    Method of         Enable
+        Capable  Informational     Exception  Performance  Error
-=== Adaptec controllers ===
+B:ID:L  Device   Exceptions(MRIE)  Control    Enabled      Count
-Here's some sample output:
+------  -------  ----------------  ---------  -----------  ------
-<pre>
+:00:0     Y            6             Y           N             0
-mail /usr/local/www/scripts# sh /root/verify.sh
+:01:0     Y            6             Y           N             0
----------------------------------------------------------------------------------------------
+:02:0     Y            6             Y           N             0
+:03:0     Y            6             Y           N             0
+:06:0     N
+AAC0> task list
+Executing: task list
-Adaptec SCSI RAID Controller Command Line Interface
+Controller Tasks
-Copyright 1998-2002 Adaptec, Inc. All rights reserved
----------------------------------------------------------------------------------------------
+TaskId Function  Done%  Container State Specific1 Specific2
+------ -------- ------- --------- ----- --------- ---------
-CLI > open aac0
+No tasks currently running on controller
-Executing: open "aac0"
-AAC0> container list /f
+AAC0> dia sh hi
-Executing: container list /full=TRUE
+Executing: diagnostic show history
-Num          Total  Oth Chunk          Scsi   Partition
+No switches specified, defaulting to "/current".
-Creation        System
-Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size   State   RO Lk Task    Done%  Ent
-Date   Time      Files
------ ------ ------ --- ------ ------- ------ ------------- ------- -- -- ------- ------ ---
------- -------- ------
-    Mirror 33.9GB            Open    0:01:0 64.0KB:33.9GB Normal                        0
-05:39:32
- /dev/aacd0           mirror0          0:00:0 64.0KB:33.9GB Normal                        1
-05:39:32
-    Mirror 33.9GB            Open    0:02:0 64.0KB:33.9GB Normal                        0
-05:39:50
- /dev/aacd1           mirror1          0:03:0 64.0KB:33.9GB Normal                        1
-05:39:50
-AAC0> disk list /f
+ *** HISTORY BUFFER FROM CURRENT CONTROLLER RUN ***
-Executing: disk list /full=TRUE
-B:ID:L  Device Type     Removable media  Vendor-ID Product-ID        Rev   Blocks    Bytes/Bl
+[00]: GetDiskLogEntry: container - 1, entry return 0
-ock Usage            Shared Rate
+[01]: Container 1 started SCRUB task
-------  --------------  ---------------  --------- ----------------  ----- --------- --------
+[02]: Starting Mirror:1 scrub
---- ---------------- ------ ----
+[03]: Master disk: 2, start sector: 128, sector count = 71286784
-:00:0   Disk            N                FUJITSU   MAJ3364MC         3702  71390320  512
+[04]: Slave  disk: 3, start sector: 128, sector count = 71286784
-     Initialized      NO     160
+[05]: UpdateDiskLogIndex - Set   - container 0, index 1
-:01:0   Disk            N                FUJITSU   MAJ3364MC         3702  71390320  512
+[06]: GetDiskLogEntry: container - 0, entry return 1
-     Initialized      NO     160
+[07]: Container 0 started SCRUB task
-:02:0   Disk            N                FUJITSU   MAJ3364MC         3702  71390320  512
+[08]: Starting Mirror:0 scrub
-     Initialized      NO     160
+[09]: Master disk: 1, start sector: 128, sector count = 71286784
-:03:0   Disk            N                FUJITSU   MAJ3364MC         3702  71390320  512
+[10]: Slave  disk: 0, start sector: 128, sector count = 71286784
-     Initialized      NO     160
+[11]: Mirror Scrub Container:1   ErrorsFound:0
+[12]: Clear disk log: sector - 80, driveno 2
-AAC0> disk show smart
+[13]: Clear disk log: sector - 80, driveno 3
-Executing: disk show smart
+[14]: Container 1 completed SCRUB task:
+[15]: Mirror Scrub Container:0   ErrorsFound:0
-        Smart    Method of         Enable
+[16]: Clear disk log: sector - 81, driveno 1
-        Capable  Informational     Exception  Performance  Error
+[17]: Clear disk log: sector - 81, driveno 0
-B:ID:L  Device   Exceptions(MRIE)  Control    Enabled      Count
+[18]: Container 0 completed SCRUB task:
-------  -------  ----------------  ---------  -----------  ------
+[19]: UpdateDiskLogIndex - Set   - container 0, index 0
-:00:0     Y            6             Y           N             0
+[20]: GetDiskLogEntry: container - 0, entry return 0
-:01:0     Y            6             Y           N             0
+[21]: Container 0 started SCRUB task
-:02:0     Y            6             Y           N             0
+[22]: Starting Mirror:0 scrub
-:03:0     Y            6             Y           N             0
+[23]: Master disk: 1, start sector: 128, sector count = 71286784
-:06:0     N
+[24]: Slave  disk: 0, start sector: 128, sector count = 71286784
+[25]: UpdateDiskLogIndex - Set   - container 1, index 1
-AAC0> task list
+[26]: GetDiskLogEntry: container - 1, entry return 1
-Executing: task list
+[27]: Container 1 started SCRUB task
+[28]: Starting Mirror:1 scrub
-Controller Tasks
+[29]: Master disk: 2, start sector: 128, sector count = 71286784
+[30]: Slave  disk: 3, start sector: 128, sector count = 71286784
-TaskId Function  Done%  Container State Specific1 Specific2
+[31]: Mirror Scrub Container:1   ErrorsFound:0
------- -------- ------- --------- ----- --------- ---------
+[32]: Clear disk log: sector - 81, driveno 2
+[33]: Clear disk log: sector - 81, driveno 3
-No tasks currently running on controller
+[34]: Container 1 completed SCRUB task:
+[35]: Mirror Scrub Container:0   ErrorsFound:0
-AAC0> dia sh hi
+[36]: Clear disk log: sector - 80, driveno 1
-Executing: diagnostic show history
+[37]: Clear disk log: sector - 80, driveno 0
-No switches specified, defaulting to "/current".
+[38]: Container 0 completed SCRUB task:
+[39]: UpdateDiskLogIndex - Set   - container 0, index 0
+[40]: GetDiskLogEntry: container - 0, entry return 0
+[41]: Container 0 started SCRUB task
- *** HISTORY BUFFER FROM CURRENT CONTROLLER RUN ***
+[42]: Starting Mirror:0 scrub
+[43]: Master disk: 1, start sector: 128, sector count = 71286784
-[00]: GetDiskLogEntry: container - 1, entry return 0
+[44]: Slave  disk: 0, start sector: 128, sector count = 71286784
-[01]: Container 1 started SCRUB task
+[45]: UpdateDiskLogIndex - Set   - container 1, index 1
-[02]: Starting Mirror:1 scrub
+[46]: GetDiskLogEntry: container - 1, entry return 1
-[03]: Master disk: 2, start sector: 128, sector count = 71286784
+[47]: Container 1 started SCRUB task
-[04]: Slave  disk: 3, start sector: 128, sector count = 71286784
+[48]: Starting Mirror:1 scrub
-[05]: UpdateDiskLogIndex - Set   - container 0, index 1
+[49]: Master disk: 2, start sector: 128, sector count = 71286784
-[06]: GetDiskLogEntry: container - 0, entry return 1
+[50]: Slave  disk: 3, start sector: 128, sector count = 71286784
-[07]: Container 0 started SCRUB task
+[51]: Mirror Scrub Container:1   ErrorsFound:0
-[08]: Starting Mirror:0 scrub
+[52]: Clear disk log: sector - 81, driveno 2
-[09]: Master disk: 1, start sector: 128, sector count = 71286784
+[53]: Clear disk log: sector - 81, driveno 3
-[10]: Slave  disk: 0, start sector: 128, sector count = 71286784
+[54]: Container 1 completed SCRUB task:
-[11]: Mirror Scrub Container:1   ErrorsFound:0
+[55]: Mirror Scrub Container:0   ErrorsFound:0
-[12]: Clear disk log: sector - 80, driveno 2
+[56]: Clear disk log: sector - 80, driveno 1
-[13]: Clear disk log: sector - 80, driveno 3
+[57]: Clear disk log: sector - 80, driveno 0
-[14]: Container 1 completed SCRUB task:
+[58]: Container 0 completed SCRUB task:
-[15]: Mirror Scrub Container:0   ErrorsFound:0
+[59]: UpdateDiskLogIndex - Set   - container 0, index 0
-[16]: Clear disk log: sector - 81, driveno 1
+[60]: GetDiskLogEntry: container - 0, entry return 0
-[17]: Clear disk log: sector - 81, driveno 0
+[61]: Container 0 started SCRUB task
-[18]: Container 0 completed SCRUB task:
+[62]: Starting Mirror:0 scrub
-[19]: UpdateDiskLogIndex - Set   - container 0, index 0
+[63]: Master disk: 1, start sector: 128, sector count = 71286784
-[20]: GetDiskLogEntry: container - 0, entry return 0
+[64]: Slave  disk: 0, start sector: 128, sector count = 71286784
-[21]: Container 0 started SCRUB task
+[65]: UpdateDiskLogIndex - Set   - container 1, index 1
-[22]: Starting Mirror:0 scrub
+[66]: GetDiskLogEntry: container - 1, entry return 1
-[23]: Master disk: 1, start sector: 128, sector count = 71286784
+[67]: Container 1 started SCRUB task
-[24]: Slave  disk: 0, start sector: 128, sector count = 71286784
+[68]: Starting Mirror:1 scrub
-[25]: UpdateDiskLogIndex - Set   - container 1, index 1
+[69]: Master disk: 2, start sector: 128, sector count = 71286784
-[26]: GetDiskLogEntry: container - 1, entry return 1
+[70]: Slave  disk: 3, start sector: 128, sector count = 71286784
-[27]: Container 1 started SCRUB task
+[71]: Mirror Scrub Container:1   ErrorsFound:0
-[28]: Starting Mirror:1 scrub
+[72]: Clear disk log: sector - 81, driveno 2
-[29]: Master disk: 2, start sector: 128, sector count = 71286784
+[73]: Clear disk log: sector - 81, driveno 3
-[30]: Slave  disk: 3, start sector: 128, sector count = 71286784
+[74]: Container 1 completed SCRUB task:
-[31]: Mirror Scrub Container:1   ErrorsFound:0
+[75]: Mirror Scrub Container:0   ErrorsFound:0
-[32]: Clear disk log: sector - 81, driveno 2
+[76]: Clear disk log: sector - 80, driveno 1
-[33]: Clear disk log: sector - 81, driveno 3
+[77]: Clear disk log: sector - 80, driveno 0
-[34]: Container 1 completed SCRUB task:
+[78]: Container 0 completed SCRUB task:
-[35]: Mirror Scrub Container:0   ErrorsFound:0
+[79]: UpdateDiskLogIndex - Set   - container 0, index 0
-[36]: Clear disk log: sector - 80, driveno 1
+[80]: GetDiskLogEntry: container - 0, entry return 0
-[37]: Clear disk log: sector - 80, driveno 0
+[81]: Container 0 started SCRUB task
-[38]: Container 0 completed SCRUB task:
+[82]: Starting Mirror:0 scrub
-[39]: UpdateDiskLogIndex - Set   - container 0, index 0
+[83]: Master disk: 1, start sector: 128, sector count = 71286784
-[40]: GetDiskLogEntry: container - 0, entry return 0
+[84]: Slave  disk: 0, start sector: 128, sector count = 71286784
-[41]: Container 0 started SCRUB task
+[85]: UpdateDiskLogIndex - Set   - container 1, index 1
-[42]: Starting Mirror:0 scrub
+[86]: GetDiskLogEntry: container - 1, entry return 1
-[43]: Master disk: 1, start sector: 128, sector count = 71286784
+[87]: Container 1 started SCRUB task
-[44]: Slave  disk: 0, start sector: 128, sector count = 71286784
+[88]: Starting Mirror:1 scrub
-[45]: UpdateDiskLogIndex - Set   - container 1, index 1
+[89]: Master disk: 2, start sector: 128, sector count = 71286784
-[46]: GetDiskLogEntry: container - 1, entry return 1
+[90]: Slave  disk: 3, start sector: 128, sector count = 71286784
-[47]: Container 1 started SCRUB task
+[91]: Mirror Scrub Container:1   ErrorsFound:0
-[48]: Starting Mirror:1 scrub
+[92]: Clear disk log: sector - 81, driveno 2
-[49]: Master disk: 2, start sector: 128, sector count = 71286784
+[93]: Clear disk log: sector - 81, driveno 3
-[50]: Slave  disk: 3, start sector: 128, sector count = 71286784
+[94]: Container 1 completed SCRUB task:
-[51]: Mirror Scrub Container:1   ErrorsFound:0
+[95]: Mirror Scrub Container:0   ErrorsFound:0
-[52]: Clear disk log: sector - 81, driveno 2
+[96]: Clear disk log: sector - 80, driveno 1
-[53]: Clear disk log: sector - 81, driveno 3
+[97]: Clear disk log: sector - 80, driveno 0
-[54]: Container 1 completed SCRUB task:
+[98]: Container 0 completed SCRUB task:
-[55]: Mirror Scrub Container:0   ErrorsFound:0
+[99]:
-[56]: Clear disk log: sector - 80, driveno 1
-[57]: Clear disk log: sector - 80, driveno 0
+========================
-[58]: Container 0 completed SCRUB task:
+History Output Complete.
-[59]: UpdateDiskLogIndex - Set   - container 0, index 0
-[60]: GetDiskLogEntry: container - 0, entry return 0
+AAC0>
-[61]: Container 0 started SCRUB task
+AAC0> exit
-[62]: Starting Mirror:0 scrub
+Executing: exit
-[63]: Master disk: 1, start sector: 128, sector count = 71286784
-[64]: Slave  disk: 0, start sector: 128, sector count = 71286784
+press enter when ready to run verify                                                 <INS>
-[65]: UpdateDiskLogIndex - Set   - container 1, index 1
+---------------------------------------------------------------------------------------------
-[66]: GetDiskLogEntry: container - 1, entry return 1
-[67]: Container 1 started SCRUB task
+Adaptec SCSI RAID Controller Command Line Interface
-[68]: Starting Mirror:1 scrub
+Copyright 1998-2002 Adaptec, Inc. All rights reserved
-[69]: Master disk: 2, start sector: 128, sector count = 71286784
+---------------------------------------------------------------------------------------------
-[70]: Slave  disk: 3, start sector: 128, sector count = 71286784
-[71]: Mirror Scrub Container:1   ErrorsFound:0
-[72]: Clear disk log: sector - 81, driveno 2
+CLI > open aac0
-[73]: Clear disk log: sector - 81, driveno 3
+Executing: open "aac0"
-[74]: Container 1 completed SCRUB task:
-[75]: Mirror Scrub Container:0   ErrorsFound:0
+AAC0> contai scr 0
-[76]: Clear disk log: sector - 80, driveno 1
+Executing: container scrub 0
-[77]: Clear disk log: sector - 80, driveno 0
-[78]: Container 0 completed SCRUB task:
+AAC0> contai scr 1
-[79]: UpdateDiskLogIndex - Set   - container 0, index 0
+Executing: container scrub 1
-[80]: GetDiskLogEntry: container - 0, entry return 0
-[81]: Container 0 started SCRUB task
-[82]: Starting Mirror:0 scrub
-[83]: Master disk: 1, start sector: 128, sector count = 71286784
-[84]: Slave  disk: 0, start sector: 128, sector count = 71286784
-[85]: UpdateDiskLogIndex - Set   - container 1, index 1
-[86]: GetDiskLogEntry: container - 1, entry return 1
-[87]: Container 1 started SCRUB task
-[88]: Starting Mirror:1 scrub
-[89]: Master disk: 2, start sector: 128, sector count = 71286784
-[90]: Slave  disk: 3, start sector: 128, sector count = 71286784
-[91]: Mirror Scrub Container:1   ErrorsFound:0
-[92]: Clear disk log: sector - 81, driveno 2
-[93]: Clear disk log: sector - 81, driveno 3
-[94]: Container 1 completed SCRUB task:
-[95]: Mirror Scrub Container:0   ErrorsFound:0
-[96]: Clear disk log: sector - 80, driveno 1
-[97]: Clear disk log: sector - 80, driveno 0
-[98]: Container 0 completed SCRUB task:
-[99]:
-========================
+AAC0> exit
-History Output Complete.
+Executing: exit
-AAC0>
-AAC0> exit
-Executing: exit
-press enter when ready to run verify                                                 <INS>
----------------------------------------------------------------------------------------------
-Adaptec SCSI RAID Controller Command Line Interface
-Copyright 1998-2002 Adaptec, Inc. All rights reserved
----------------------------------------------------------------------------------------------
-CLI > open aac0
-Executing: open "aac0"
-AAC0> contai scr 0
-Executing: container scrub 0
-AAC0> contai scr 1
-Executing: container scrub 1
-AAC0> exit
-Executing: exit
 when done run:
@@ Line 1,265: / Line 1,403: @@
 [root@newbackup3 ~]#
 </pre>
+== Update OS list ==
+# check for any new VZ templates we want to offer: <tt>vzup2date -z</tt>
+# see if there's any OS's we want to include in our colo install list. Update 2 places: <tt>signup/html/colo_quote.html</tt> & <tt>signup/html/step1.html</tt>
+# update the mgmt database (ref_templates table, ref_systems table).
 = Infrequent tasks =
+== Free up space on gateway ==
+<pre>newgateway /var/spool# cd clientmqueue/
+newgateway /var/spool/clientmqueue# sh
+# for f in `ls`; do rm $f; done
+exit</pre>
 == Free up space on mail ==
@@ Line 1,315: / Line 1,467: @@
 == Free up space on bwdb2 ==
-You can either remove items from <tt>/usr/home/archive</tt> or you can scp them to backup2 @ castle.
+You can either remove items from <tt>/usr/home/archive</tt> or you can scp them to backup3:/data/bwdb2/archive .
 == Free up space on backup1 ==
@@ Line 1,371: / Line 1,523: @@
 </pre>
-Every few months you will also want to remove some of the snapshot archives for mail. To do this you set aside the dates you want to save then remove months at a time, followed by restoring the set aside dates. Here's how that works:
+Every few months you will also want to remove some of the snapshot archives for mail.  We typically save the 1st, 10th, and 20th of each month.  To do this you set aside the dates you want to save then remove months at a time, followed by restoring the set aside dates. Here's how that works:
 <pre>[root@backup1 /data/www/daily]# ls
                      08-10-11  10-04-10  11-10-10  12-07-29  12-09-21  12-11-14
@@ Line 1,410: / Line 1,562: @@
 -03-01               09-10-01  11-04-01  12-07-10  12-09-02  12-10-26  12-12-19
 -03-10               09-10-10  11-04-10  12-07-11  12-09-03  12-10-27  12-12-20
 -03-20               09-10-20  11-04-20  12-07-12  12-09-04  12-10-28  12-12-21
 -04-01               09-11-01  11-05-01  12-07-13  12-09-05  12-10-29  12-12-22
 -04-20               09-11-10  11-05-10  12-07-14  12-09-06  12-10-30  12-12-23
 -05-01               09-11-20  11-05-20  12-07-15  12-09-07  12-10-31  12-12-24
 -05-10               09-12-01  11-06-01  12-07-16  12-09-08  12-11-01  12-12-25
 -06-10               09-12-10  11-06-10  12-07-17  12-09-09  12-11-02  12-12-26
 -06-20               09-12-20  11-06-20  12-07-18  12-09-10  12-11-03  12-12-27
 -07-02               10-01-01  11-07-01  12-07-19  12-09-11  12-11-04  12-12-28
 -07-10               10-01-10  11-07-10  12-07-20  12-09-12  12-11-05  2008-10-23
 -07-20               10-01-20  11-07-20  12-07-21  12-09-13  12-11-06  bb.tgz
 -08-01               10-02-01  11-08-01  12-07-22  12-09-14  12-11-07  boot
 -08-10               10-02-10  11-08-10  12-07-23  12-09-15  12-11-08  current
 -08-21               10-02-20  11-08-20  12-07-24  12-09-16  12-11-09  hold
 -09-01               10-03-01  11-09-01  12-07-25  12-09-17  12-11-10
 -09-10               10-03-10  11-09-10  12-07-26  12-09-18  12-11-11
 -09-21               10-03-20  11-09-20  12-07-27  12-09-19  12-11-12
 -10-01               10-04-01  11-10-01  12-07-28  12-09-20  12-11-13
 [root@backup1 /data/www/daily]#
+</pre>
+So we see that everything up to July 2012 has been pruned. To prune July 2012 we do the following:
+<pre>mv 12-07-01 hold
+mv 12-07-10 hold
+mv 12-07-20 hold
+rm -fr 12-07*
+mv hold/* .</pre>
+== Free up space on Other Servers ==
+Many servers start to run out of disk space over time.  Often it is caused by unread mail for
+root or log files.
+To find the source of the problems, you use "du" to find where the disk space is being used.
+You can't do a du on /proc or /dev, so I use the command
+<pre>
+[root@virt11 /]# du -hs [a-c]* deprecated [e-o]* [q-u]* var | tee duhs0
+which produces something like this.
+.0K    backup
+.0K    backup1
+.0K    backup2
+.0K    backup3
+.0K    backup4
+.5M    bin
+M     boot
+.0K    deprecated
+M     etc
+M     home
+.0K    initrd
+M    lib
+K     lost+found
+.0K    media
+       misc
+.0K    mnt
+       net
+M     opt
+K    root
+M     sbin
+.0K    selinux
+.0K    srv
+       sys
+.0K    test
+K     tmp
+.2G    usr
+M    var
+In this case it looks like /var is the problem, so
+cd /var
+du -hs * | tee duhs9
+Produces
+K     account
+.6M    analog-5.32
+M     cache
+K     db
+.0K    duhs
+.0K    duhs1
+.0K    duhs2
+.0K    duhs3
+.0K    duhs4
+.0K    duhs5
+.0K    duhs6
+.0K    duhs7
+.0K    duhs8
+K     empty
+.0K    games
+K     kerberos
+M     lib
+.0K    local
+K     lock
+M    log
+       mail
+.0K    nis
+.0K    opt
+.0K    preserve
+.0K    racoon
+K    run
+M     spool
+.0K    tmp
+K     vz
+       vzagent
+       vzagent.tmp
+K     vzquota
+.2M    www
+K     yp
+Usually, the problem is in /var/spool or /var/log, due to unread mail
+or excessive log files.  You can continue to drill down by doing
+a "cd <sbdirectory>" and another "du -hs *".
 </pre>
-So we see that everything up to July 2012 has been pruned. To prune July 2012 we do the following:
-<pre>mv 12-07-01 hold
-mv 12-07-10 hold
-mv 12-07-20 hold
-rm -fr 12-07*
-mv hold/* .</pre>

Routine Maintenance: Difference between revisions

Latest revision as of 11:39, 10 June 2020

Daily Tasks[edit]

check load graphs[edit]

Load averages jump at night[edit]

check backups[edit]

check bb for warnings[edit]

check jail5 for crashed VPSs[edit]

Check NetHere[edit]

Mail systems[edit]

Incoming[edit]

Outgoing[edit]

Nagios[edit]

Cacti[edit]

Monthly Tasks[edit]

rotate pine sent mail (1st of month)[edit]

b/w caps[edit]

Monthly RAID checks[edit]

Adaptec controllers[edit]

DELL (LSI-based) SAS controllers[edit]

LSI-based controllers (megaraid)[edit]

3ware[edit]

areca[edit]

Update OS list[edit]

Infrequent tasks[edit]

Free up space on gateway[edit]

Free up space on mail[edit]

Free up space on bwdb2[edit]

Free up space on backup1[edit]

Free up space on Other Servers[edit]

Navigation menu

Search