Routine Maintenance

From JCWiki
Jump to navigation Jump to search

Free up space on backup1

backup1 is our primary customer backup system. As usage grows over time, it needs to be regularly purged of old files. The easiest way to do this is by removing deprecated files. These mostly consist of cancelled customers or temporary dump/storage files (created during dump/restores). Our standard policy is to hang onto cancelled customers for 6mos after which we remove their files (as far as customers know their data is purged immediately, but we hang onto it just in case.. and in some cases we cancel a server due to non payment so this makes it easy to restore their system). To find files to remove:

[root@backup1 ~]# cd /data/deprecated/
[root@backup1 /data/deprecated]# ls
2101-migrated-20120317.tgz                old-683-cxld-20121021.tgz
69.55.230.2-wwwbackup                     old-744-cxld-20120708.tgz
991-DONTDELETE.tgz                        old-809-cxld-20120609.tgz
archive-col02050-mdfile-cxld-20120409.gz  old-854-cxld-20120621.tgz
col01371.tgz                              old-931-cxld-20060513.tgz
deleteme_ubuntu-10.10-x86_20111205        old-col00123-mdfile-noarchive-20120417.gz
jail10_old                                old-col00147-vnfile-cxld-20120828.gz
jail14_rsync_old                          old-col00419-dump-cxld-20120224.gz
jail15_old                                old-col01098-vnfile-cxld-20120827.gz
jail3_old                                 old-col01278-dump-cxld-20120822
jail4_old                                 old-col01517-dump-cxld-20120828
jail5_old                                 old-col01669-dump-cxld-20120203.gz
old-1009-cxld-20120608.tgz                old-col01687-dump-cxld-20120909
old-1012-cxld-20120411.tgz                old-col01790-dump-cxld-20120828
old-1052-cxld-20120721.tgz                old-col01812-dump-cxld-20120820
old-10631-cxld-20120622.tgz               old-col01938-mdfile-cxld-20120619.gz
old-10632-cxld-20120622.tgz               old-col02095-mdfile-noarchive-20120523.gz
old-10633-cxld-20120622.tgz               olddebian-3.0-v15-20110610.tgz
old-1236-cxld-20120621.tgz                oldmod_frontpage-deb30-v15-20110610.tgz
old-1381-cxld-20120404.tgz                oldmod_perl-deb30-v15-20110610.tgz
old-1422-cxld-20120721.tgz                oldmod_ssl-deb30-v15-20110610.tgz
old-14681-cxld-20120619.tgz               oldmysql-deb30-v15-20110610.tgz
old-1544-cxld-20120626.tgz                oldproftpd-deb30-v15-20110610.tgz
old-18351-cxld-20120605.tgz               old_virt14
old-1853-cxld-20120910.tgz                old_virt18
old-1963-cxld-20120206.tgz                oldwebmin-deb30-v15-20110610.tgz
old-1967-cxld-20120605.tgz                suse.virt11.20120421.tgz
old-1981-noarchive-20120729.tgz           virt11
old-2030-migrated-noarchive-20120727.tgz  virt12_old
old-2037-cxld-20120716.tgz                virt13_old
old-2065-cxld-20120727.tgz                virt16_old
old-2068-cxld-20120424.tgz                virt4_old
old-2085-cxld-20120531.tgz                virt5_old
old-364-cxld-20120904.tgz                 virt6_old
old-446-cxld-20120512.tgz                 virt7_old
old-613-cxld-20120601.tgz                 virt8_old
[root@backup1 /data/deprecated]#

virtX_old and jailX_old are permanently archived, so ignore those as well as anything else marked not to delete or otherwise suspicious. Likewise, probably a good idea to try to hang onto oldTEMPLATE.gz as long as we can as well. Most of the stuff we want to delete is dated when it was deprecated, making this easy. So to remove files from 6 mos ago (running this in Oct):

[root@backup1 /data/deprecated]# ls old*201204*
old-1012-cxld-20120411.tgz  old-2068-cxld-20120424.tgz
old-1381-cxld-20120404.tgz  old-col00123-mdfile-noarchive-20120417.gz
[root@backup1 /data/deprecated]# rm old*201204*

Monthly RAID checks

Every month we check the health of and verfy the parity on all our RAID-based systems. To facilitate this, we've created a simple script to start the process:

sh /root/verify.sh

Adaptec-based servers

Here's some sample output:

mail /usr/local/www/scripts# sh /root/verify.sh
---------------------------------------------------------------------------------------------

Adaptec SCSI RAID Controller Command Line Interface
Copyright 1998-2002 Adaptec, Inc. All rights reserved
---------------------------------------------------------------------------------------------


CLI > open aac0
Executing: open "aac0"

AAC0> container list /f
Executing: container list /full=TRUE
Num          Total  Oth Chunk          Scsi   Partition
Creation        System
Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size   State   RO Lk Task    Done%  Ent
Date   Time      Files
----- ------ ------ --- ------ ------- ------ ------------- ------- -- -- ------- ------ ---
------ -------- ------
 0    Mirror 33.9GB            Open    0:01:0 64.0KB:33.9GB Normal                        0
071002 05:39:32
 /dev/aacd0           mirror0          0:00:0 64.0KB:33.9GB Normal                        1
071002 05:39:32

 1    Mirror 33.9GB            Open    0:02:0 64.0KB:33.9GB Normal                        0
071002 05:39:50
 /dev/aacd1           mirror1          0:03:0 64.0KB:33.9GB Normal                        1
071002 05:39:50


AAC0> disk list /f
Executing: disk list /full=TRUE

B:ID:L  Device Type     Removable media  Vendor-ID Product-ID        Rev   Blocks    Bytes/Bl
ock Usage            Shared Rate
------  --------------  ---------------  --------- ----------------  ----- --------- --------
--- ---------------- ------ ----
0:00:0   Disk            N                FUJITSU   MAJ3364MC         3702  71390320  512
     Initialized      NO     160
0:01:0   Disk            N                FUJITSU   MAJ3364MC         3702  71390320  512
     Initialized      NO     160
0:02:0   Disk            N                FUJITSU   MAJ3364MC         3702  71390320  512
     Initialized      NO     160
0:03:0   Disk            N                FUJITSU   MAJ3364MC         3702  71390320  512
     Initialized      NO     160

AAC0> disk show smart
Executing: disk show smart

        Smart    Method of         Enable
        Capable  Informational     Exception  Performance  Error
B:ID:L  Device   Exceptions(MRIE)  Control    Enabled      Count
------  -------  ----------------  ---------  -----------  ------
0:00:0     Y            6             Y           N             0
0:01:0     Y            6             Y           N             0
0:02:0     Y            6             Y           N             0
0:03:0     Y            6             Y           N             0
0:06:0     N

AAC0> task list
Executing: task list

Controller Tasks

TaskId Function  Done%  Container State Specific1 Specific2
------ -------- ------- --------- ----- --------- ---------

No tasks currently running on controller

AAC0> dia sh hi
Executing: diagnostic show history
No switches specified, defaulting to "/current".



 *** HISTORY BUFFER FROM CURRENT CONTROLLER RUN ***

[00]: GetDiskLogEntry: container - 1, entry return 0
[01]: Container 1 started SCRUB task
[02]: Starting Mirror:1 scrub
[03]: Master disk: 2, start sector: 128, sector count = 71286784
[04]: Slave  disk: 3, start sector: 128, sector count = 71286784
[05]: UpdateDiskLogIndex - Set   - container 0, index 1
[06]: GetDiskLogEntry: container - 0, entry return 1
[07]: Container 0 started SCRUB task
[08]: Starting Mirror:0 scrub
[09]: Master disk: 1, start sector: 128, sector count = 71286784
[10]: Slave  disk: 0, start sector: 128, sector count = 71286784
[11]: Mirror Scrub Container:1   ErrorsFound:0
[12]: Clear disk log: sector - 80, driveno 2
[13]: Clear disk log: sector - 80, driveno 3
[14]: Container 1 completed SCRUB task:
[15]: Mirror Scrub Container:0   ErrorsFound:0
[16]: Clear disk log: sector - 81, driveno 1
[17]: Clear disk log: sector - 81, driveno 0
[18]: Container 0 completed SCRUB task:
[19]: UpdateDiskLogIndex - Set   - container 0, index 0
[20]: GetDiskLogEntry: container - 0, entry return 0
[21]: Container 0 started SCRUB task
[22]: Starting Mirror:0 scrub
[23]: Master disk: 1, start sector: 128, sector count = 71286784
[24]: Slave  disk: 0, start sector: 128, sector count = 71286784
[25]: UpdateDiskLogIndex - Set   - container 1, index 1
[26]: GetDiskLogEntry: container - 1, entry return 1
[27]: Container 1 started SCRUB task
[28]: Starting Mirror:1 scrub
[29]: Master disk: 2, start sector: 128, sector count = 71286784
[30]: Slave  disk: 3, start sector: 128, sector count = 71286784
[31]: Mirror Scrub Container:1   ErrorsFound:0
[32]: Clear disk log: sector - 81, driveno 2
[33]: Clear disk log: sector - 81, driveno 3
[34]: Container 1 completed SCRUB task:
[35]: Mirror Scrub Container:0   ErrorsFound:0
[36]: Clear disk log: sector - 80, driveno 1
[37]: Clear disk log: sector - 80, driveno 0
[38]: Container 0 completed SCRUB task:
[39]: UpdateDiskLogIndex - Set   - container 0, index 0
[40]: GetDiskLogEntry: container - 0, entry return 0
[41]: Container 0 started SCRUB task
[42]: Starting Mirror:0 scrub
[43]: Master disk: 1, start sector: 128, sector count = 71286784
[44]: Slave  disk: 0, start sector: 128, sector count = 71286784
[45]: UpdateDiskLogIndex - Set   - container 1, index 1
[46]: GetDiskLogEntry: container - 1, entry return 1
[47]: Container 1 started SCRUB task
[48]: Starting Mirror:1 scrub
[49]: Master disk: 2, start sector: 128, sector count = 71286784
[50]: Slave  disk: 3, start sector: 128, sector count = 71286784
[51]: Mirror Scrub Container:1   ErrorsFound:0
[52]: Clear disk log: sector - 81, driveno 2
[53]: Clear disk log: sector - 81, driveno 3
[54]: Container 1 completed SCRUB task:
[55]: Mirror Scrub Container:0   ErrorsFound:0
[56]: Clear disk log: sector - 80, driveno 1
[57]: Clear disk log: sector - 80, driveno 0
[58]: Container 0 completed SCRUB task:
[59]: UpdateDiskLogIndex - Set   - container 0, index 0
[60]: GetDiskLogEntry: container - 0, entry return 0
[61]: Container 0 started SCRUB task
[62]: Starting Mirror:0 scrub
[63]: Master disk: 1, start sector: 128, sector count = 71286784
[64]: Slave  disk: 0, start sector: 128, sector count = 71286784
[65]: UpdateDiskLogIndex - Set   - container 1, index 1
[66]: GetDiskLogEntry: container - 1, entry return 1
[67]: Container 1 started SCRUB task
[68]: Starting Mirror:1 scrub
[69]: Master disk: 2, start sector: 128, sector count = 71286784
[70]: Slave  disk: 3, start sector: 128, sector count = 71286784
[71]: Mirror Scrub Container:1   ErrorsFound:0
[72]: Clear disk log: sector - 81, driveno 2
[73]: Clear disk log: sector - 81, driveno 3
[74]: Container 1 completed SCRUB task:
[75]: Mirror Scrub Container:0   ErrorsFound:0
[76]: Clear disk log: sector - 80, driveno 1
[77]: Clear disk log: sector - 80, driveno 0
[78]: Container 0 completed SCRUB task:
[79]: UpdateDiskLogIndex - Set   - container 0, index 0
[80]: GetDiskLogEntry: container - 0, entry return 0
[81]: Container 0 started SCRUB task
[82]: Starting Mirror:0 scrub
[83]: Master disk: 1, start sector: 128, sector count = 71286784
[84]: Slave  disk: 0, start sector: 128, sector count = 71286784
[85]: UpdateDiskLogIndex - Set   - container 1, index 1
[86]: GetDiskLogEntry: container - 1, entry return 1
[87]: Container 1 started SCRUB task
[88]: Starting Mirror:1 scrub
[89]: Master disk: 2, start sector: 128, sector count = 71286784
[90]: Slave  disk: 3, start sector: 128, sector count = 71286784
[91]: Mirror Scrub Container:1   ErrorsFound:0
[92]: Clear disk log: sector - 81, driveno 2
[93]: Clear disk log: sector - 81, driveno 3
[94]: Container 1 completed SCRUB task:
[95]: Mirror Scrub Container:0   ErrorsFound:0
[96]: Clear disk log: sector - 80, driveno 1
[97]: Clear disk log: sector - 80, driveno 0
[98]: Container 0 completed SCRUB task:
[99]:

========================
History Output Complete.

AAC0>
AAC0> exit
Executing: exit

press enter when ready to run verify                                                 <INS>
---------------------------------------------------------------------------------------------

Adaptec SCSI RAID Controller Command Line Interface
Copyright 1998-2002 Adaptec, Inc. All rights reserved
---------------------------------------------------------------------------------------------


CLI > open aac0
Executing: open "aac0"

AAC0> contai scr 0
Executing: container scrub 0

AAC0> contai scr 1
Executing: container scrub 1

AAC0> exit
Executing: exit

when done run:                                                                       

aaccli
open aac0
dia sh hi
c


Nov  1 10:32:46 mail /kernel: aac0: **Monitor** Container 0 started SCRUB task
Nov  1 10:32:47 mail /kernel: aac0: **Monitor** Container 1 started SCRUB task

Here's an analysis of what we're seeing and what we're looking for:

AAC0> container list /f
Executing: container list /full=TRUE
Num          Total  Oth Chunk          Scsi   Partition
Creation        System
Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size   State   RO Lk Task    Done%  Ent
Date   Time      Files
----- ------ ------ --- ------ ------- ------ ------------- ------- -- -- ------- ------ ---
------ -------- ------
 0    Mirror 33.9GB            Open    0:01:0 64.0KB:33.9GB Normal                        0
071002 05:39:32
 /dev/aacd0           mirror0          0:00:0 64.0KB:33.9GB Normal                        1
071002 05:39:32

 1    Mirror 33.9GB            Open    0:02:0 64.0KB:33.9GB Normal                        0
071002 05:39:50
 /dev/aacd1           mirror1          0:03:0 64.0KB:33.9GB Normal                        1
071002 05:39:50

This is showing you the health of the arrays. You're looking for Normal under the State column, and the absence of a ! in the sector size - sometimes, you'll see this:

64.0KB!33.9GB 

That indicates a problem.

AAC0> disk show smart
Executing: disk show smart

        Smart    Method of         Enable
        Capable  Informational     Exception  Performance  Error
B:ID:L  Device   Exceptions(MRIE)  Control    Enabled      Count
------  -------  ----------------  ---------  -----------  ------
0:00:0     Y            6             Y           N             0
0:01:0     Y            6             Y           N             0
0:02:0     Y            6             Y           N             0
0:03:0     Y            6             Y           N             0
0:06:0     N

This shows you a SMART report output. Looking for values in the Error Count column.

AAC0> task list
Executing: task list

Controller Tasks

TaskId Function  Done%  Container State Specific1 Specific2
------ -------- ------- --------- ----- --------- ---------

No tasks currently running on controller

Look for absence of tasks running- a bad thing would be to see a rebuild or verify running when you didn't initiate it.

With the history output, you're looking for any anomalies or events since the last time a verify was run. If you see a drive with lots of problems, you may want to take backups before allowing the verify to run since it could replicate errors onto the good drive.

After you see the history output, it will prompt you to press enter to run the verify. If you're happy with all the output you're seeing- mirror is healthy, history looks good, it's safe to proceed. Otherwise ^C to exit. After hitting enter it will start the verify and start to tail the messages log so you can easily see when the verify is complete. At which point you will run the provided output to followup and view the history to see the results of the verify. So, putting it all together, after hitting enter to start the verify, you'll see:

---------------------------------------------------------------------------------------------

Adaptec SCSI RAID Controller Command Line Interface
Copyright 1998-2002 Adaptec, Inc. All rights reserved
---------------------------------------------------------------------------------------------


CLI > open aac0
Executing: open "aac0"

AAC0> contai scr 0
Executing: container scrub 0

AAC0> contai scr 1
Executing: container scrub 1

AAC0> exit
Executing: exit

when done run:                                                                       

aaccli
open aac0
dia sh hi
c


Nov  1 10:32:46 mail /kernel: aac0: **Monitor** Container 0 started SCRUB task
Nov  1 10:32:47 mail /kernel: aac0: **Monitor** Container 1 started SCRUB task

When the scrub(s) (verify) are complete - if the server has multiple logical drives, it will run both in parallel - you should run:

aaccli
open aac0
dia sh hi
c

Which will show you the diagnostic history, you're looking for the results of the most recent scrub:

[100]: Mirror Scrub Container:1   ErrorsFound:0
[101]: Clear disk log: sector - 81, driveno 2
[102]: Clear disk log: sector - 81, driveno 3
[103]: Container 1 completed SCRUB task:
[104]: Mirror Scrub Container:0   ErrorsFound:0
[105]: Clear disk log: sector - 80, driveno 1
[106]: Clear disk log: sector - 80, driveno 0
[107]: Container 0 completed SCRUB task:

If you see:

[104]: Mirror Scrub Container:0   ErrorsFound:5

You'll want to rerun the verify on that drive till it shows 0, or perhaps replace the drive- you should be able to see from the output which drive had the problem.

See Adaptec RAID CLI Reference for more details on how to use the CLI