Revision as of 13:02, 5 November 2012

"snapshot rotation done on backup1"

Sent daily - expect to receive this!

Action to take: Confirm that this was received before midnight (or whenever backups start from virts/jails -> backup server). Delete email.

"RAID controller problem on backup1.johncompanies.com"

Ignoring all alarms prior to 2012-09-12-12-36-37
unitu0 drive p1 status= DEVICE-ERROR
there was a WARNING event on 2012-09-14 01:59:39
there was a WARNING event on 2012-09-14 02:08:27
there was a WARNING event on 2012-09-14 03:54:47
there was a WARNING event on 2012-09-15 02:38:14
there was a WARNING event on 2012-09-15 02:59:02
there was a WARNING event on 2012-09-15 04:47:08
there was a WARNING event on 2012-09-15 04:47:31
there was a WARNING event on 2012-09-15 10:41:59
there was a WARNING event on 2012-09-15 13:25:23
there was a WARNING event on 2012-09-15 13:25:31
there was a WARNING event on 2012-09-15 13:25:54
there was a WARNING event on 2012-09-15 17:10:50
there was a WARNING event on 2012-09-18 01:17:18
there was a WARNING event on 2012-09-25 01:56:47
there was a WARNING event on 2012-09-29 02:04:14
there was a WARNING event on 2012-09-29 10:58:39
there was a WARNING event on 2012-09-29 10:59:02
there was a WARNING event on 2012-09-29 11:22:44
there was a WARNING event on 2012-09-29 13:50:48
there was a WARNING event on 2012-09-29 13:51:11
there was a WARNING event on 2012-09-29 13:51:30
there was a WARNING event on 2012-10-01 04:47:24
there was a WARNING event on 2012-10-02 02:00:27
there was a WARNING event on 2012-10-02 02:01:56
there was a WARNING event on 2012-10-02 05:02:31
there was a WARNING event on 2012-10-02 05:04:14
there was a WARNING event on 2012-10-03 01:22:12
there was a WARNING event on 2012-10-04 04:29:22
there was a WARNING event on 2012-10-04 05:10:51
there was a WARNING event on 2012-10-06 19:41:18
there was a WARNING event on 2012-10-08 00:32:06
there was a WARNING event on 2012-10-09 03:51:03
to see all status: tw_cli /c0 show all
to see all alarms: tw_cli show alarms
to silence old alarms: 3wraidchk shh

You get this when the cronjob running on the server notices there's been a new event in the logs, and those recent events are include.

Action to take: Review the logs on the server. See commands above and review tw_cli_Reference. Optional: clear the warning with 3wraidchk shh on the server that generated the notice. Delete email.

"bwdb2: sendsql.pl error"

scp -Cq /usr/home/sql/2012-10-29-10:30.sql.bz2 backup1:/data/bwdb2/pending/ (lost
connection)

Action to take: Usually none, this is a temporary failure to transfer bandwidth stats from bwdb2 to backup1 where the database lives that tracks and supplies all b/w stats. The script will continue to attempt to send data over. Only take action if it continues to fail without obvious reason (i.e. temporary outage i2b<->castle or backup1 down)

"replication failing"

replication behind by 0 days, 12 hours, 23 minutes

This message serves as a warning that the mysql replication between bwdb and backup1 is not in sync any longer. This could be the result of 1 of 2 things:

1. replication really isn't happening anymore

2. there is a backlog of netflow(s) that bwdb is working it's way through. That's why the latest data in the DB (which is what this notification looks for- stale data) is from hours ago.

To rule out #2, on bwdb simply look at the queue of flows to be processed:

bwdb /home/flowbin# ls /usr/home/working
ft-v05.2012-11-05.113000-0800
bwdb /home/flowbin#

You see there's 1 flow there ready to processes. Had there been a backlog you'd see lots of files there. So this message must mean there's a replication failure. Here's how to look at the replication status:

On bwdb:

mysql> SHOW MASTER STATUS;
+-----------------+-----------+--------------+------------------+
| File            | Position  | Binlog_Do_DB | Binlog_Ignore_DB |
+-----------------+-----------+--------------+------------------+
| bwdb-bin.001551 | 156935084 |              |                  |
+-----------------+-----------+--------------+------------------+
1 row in set (0.02 sec)

On backup1:

mysql> show slave status\G

*************************** 1. row ***************************
             Slave_IO_State:
                Master_Host: 10.1.4.203
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: bwdb-bin.001550
        Read_Master_Log_Pos: 505039281
             Relay_Log_File: mysqld-relay-bin.001527
              Relay_Log_Pos: 98
      Relay_Master_Log_File: bwdb-bin.001550
           Slave_IO_Running: No
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table: traffic.%
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 505039281
            Relay_Log_Space: 98
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.57 sec)

Our indicators that something is wrong come from looking at these fields from backup1:

Read_Master_Log_Pos: 505039281

This doesn't match up with what bwdb shows for log position: 156935084

Relay_Master_Log_File: bwdb-bin.001550

This doesn't match up with what bwdb shows for log file: bwdb-bin.001551

Slave_IO_Running: No

This should say Yes

Here's how this is resolved:

 mysql> stop slave;
 mysql> reset slave;
 mysql> start slave;
 mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Queueing master event to the relay log
                Master_Host: 10.1.4.203
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: bwdb-bin.001549
        Read_Master_Log_Pos: 236696828
             Relay_Log_File: mysqld-relay-bin.000003
              Relay_Log_Pos: 133842
      Relay_Master_Log_File: bwdb-bin.001549
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table: traffic.%
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 132576
            Relay_Log_Space: 238358016
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: 343413

I'm a little perplexed why it reverted to log file bwdb-bin.001549, but we see that Slave_IO_Running and Slave_SQL_Running are Yes and if you re-run the show slave status command you will see that the Read_Master_Log_Pos is incrementing.

System-generated Notifications: Difference between revisions

Revision as of 13:02, 5 November 2012

Contents

"snapshot rotation done on backup1"

"RAID controller problem on backup1.johncompanies.com"

"bwdb2: sendsql.pl error"

"replication failing"

Navigation menu

System-generated Notifications: Difference between revisions

Revision as of 13:02, 5 November 2012

"snapshot rotation done on backup1"

"RAID controller problem on backup1.johncompanies.com"

"bwdb2: sendsql.pl error"

"replication failing"

Navigation menu

Search