System-generated Notifications

From JCWiki
Jump to navigation Jump to search

"snapshot rotation done on backup1"

Sent daily - expect to receive this!

Action to take: Confirm that this was received before midnight (or whenever backups start from virts/jails -> backup server). Delete email.

"RAID controller problem on backup1.johncompanies.com"

Ignoring all alarms prior to 2012-09-12-12-36-37
unitu0 drive p1 status= DEVICE-ERROR
there was a WARNING event on 2012-09-14 01:59:39
there was a WARNING event on 2012-09-14 02:08:27
there was a WARNING event on 2012-09-14 03:54:47
there was a WARNING event on 2012-09-15 02:38:14
there was a WARNING event on 2012-09-15 02:59:02
there was a WARNING event on 2012-09-15 04:47:08
there was a WARNING event on 2012-09-15 04:47:31
there was a WARNING event on 2012-09-15 10:41:59
there was a WARNING event on 2012-09-15 13:25:23
there was a WARNING event on 2012-09-15 13:25:31
there was a WARNING event on 2012-09-15 13:25:54
there was a WARNING event on 2012-09-15 17:10:50
there was a WARNING event on 2012-09-18 01:17:18
there was a WARNING event on 2012-09-25 01:56:47
there was a WARNING event on 2012-09-29 02:04:14
there was a WARNING event on 2012-09-29 10:58:39
there was a WARNING event on 2012-09-29 10:59:02
there was a WARNING event on 2012-09-29 11:22:44
there was a WARNING event on 2012-09-29 13:50:48
there was a WARNING event on 2012-09-29 13:51:11
there was a WARNING event on 2012-09-29 13:51:30
there was a WARNING event on 2012-10-01 04:47:24
there was a WARNING event on 2012-10-02 02:00:27
there was a WARNING event on 2012-10-02 02:01:56
there was a WARNING event on 2012-10-02 05:02:31
there was a WARNING event on 2012-10-02 05:04:14
there was a WARNING event on 2012-10-03 01:22:12
there was a WARNING event on 2012-10-04 04:29:22
there was a WARNING event on 2012-10-04 05:10:51
there was a WARNING event on 2012-10-06 19:41:18
there was a WARNING event on 2012-10-08 00:32:06
there was a WARNING event on 2012-10-09 03:51:03
to see all status: tw_cli /c0 show all
to see all alarms: tw_cli show alarms
to silence old alarms: 3wraidchk shh

You get this when the cronjob running on the server notices there's been a new event in the logs, and those recent events are include.

Action to take: Review the logs on the server. See commands above and review tw_cli_Reference. Optional: clear the warning with 3wraidchk shh on the server that generated the notice. Delete email.

"bwdb2: sendsql.pl error"

scp -Cq /usr/home/sql/2012-10-29-10:30.sql.bz2 backup1:/data/bwdb2/pending/ (lost
connection)

Action to take: Usually none, this is a temporary failure to transfer bandwidth stats from bwdb2 to backup1 where the database lives that tracks and supplies all b/w stats. The script will continue to attempt to send data over. Only take action if it continues to fail without obvious reason (i.e. temporary outage i2b<->castle or backup1 down)

"replication failing"

replication behind by 0 days, 12 hours, 23 minutes

This message serves as a warning that the mysql replication between bwdb and backup1 is not in sync any longer. This could be the result of 1 of 2 things:

1. replication really isn't happening anymore

2. there is a backlog of netflow(s) that bwdb is working it's way through. That's why the latest data in the DB (which is what this notification looks for- stale data) is from hours ago.

To rule out #2, on bwdb simply look at the queue of flows to be processed:

bwdb /home/flowbin# ls /usr/home/working
ft-v05.2012-11-05.113000-0800
bwdb /home/flowbin#

You see there's 1 flow there ready to processes. Had there been a backlog you'd see lots of files there. So this message must mean there's a replication failure. Here's how to look at the replication status:

On bwdb:

mysql> SHOW MASTER STATUS;
+-----------------+-----------+--------------+------------------+
| File            | Position  | Binlog_Do_DB | Binlog_Ignore_DB |
+-----------------+-----------+--------------+------------------+
| bwdb-bin.001551 | 156935084 |              |                  |
+-----------------+-----------+--------------+------------------+
1 row in set (0.02 sec)

On backup1:

mysql> show slave status\G

*************************** 1. row ***************************
             Slave_IO_State:
                Master_Host: 10.1.4.203
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: bwdb-bin.001550
        Read_Master_Log_Pos: 505039281
             Relay_Log_File: mysqld-relay-bin.001527
              Relay_Log_Pos: 98
      Relay_Master_Log_File: bwdb-bin.001550
           Slave_IO_Running: No
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table: traffic.%
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 505039281
            Relay_Log_Space: 98
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.57 sec)

Our indicators that something is wrong come from looking at these fields from backup1:

Read_Master_Log_Pos: 505039281

This doesn't match up with what bwdb shows for log position: 156935084

Relay_Master_Log_File: bwdb-bin.001550 

This doesn't match up with what bwdb shows for log file: bwdb-bin.001551

Slave_IO_Running: No

This should say Yes

Here's how this is resolved:

 mysql> stop slave;
 mysql> reset slave;
 mysql> start slave;
 mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Queueing master event to the relay log
                Master_Host: 10.1.4.203
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: bwdb-bin.001549
        Read_Master_Log_Pos: 236696828
             Relay_Log_File: mysqld-relay-bin.000003
              Relay_Log_Pos: 133842
      Relay_Master_Log_File: bwdb-bin.001549
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table: traffic.%
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 132576
            Relay_Log_Space: 238358016
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: 343413

I'm a little perplexed why it reverted to log file bwdb-bin.001549, but we see that Slave_IO_Running and Slave_SQL_Running are Yes and if you re-run the show slave status command you will see that the Read_Master_Log_Pos is incrementing. However after running through all of the log file bwdb-bin.001550, it eventually fails:

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State:
                Master_Host: 10.1.4.203
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: bwdb-bin.001550
        Read_Master_Log_Pos: 504660170
             Relay_Log_File: mysqld-relay-bin.000006
              Relay_Log_Pos: 508191412
      Relay_Master_Log_File: bwdb-bin.001550
           Slave_IO_Running: No
          Slave_SQL_Running: No
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table: traffic.%
Replicate_Wild_Ignore_Table:
                 Last_Errno: 1064
                 Last_Error: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1' on query. Default database: ''. Query: ''
               Skip_Counter: 0
        Exec_Master_Log_Pos: 504659938
            Relay_Log_Space: 508191646
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

Probably cause it got to a corrupted part of the log file. After looking at http://www.zmonkey.org/blog/node/157 it looks like i may need to skip over a bad part in the log and go to the next entry for the next log position. However, after performing this on bwdb:

bwdb /usr/home/database# mysqlbinlog bwdb-bin.001550 -j 504659938
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
# at 504659938
#121104 20:04:18 server id 1  log_pos 504659938         Query   thread_id=15838 exec_time=0 error_code=0
SET TIMESTAMP=1352088258;
;
ERROR: Error in Log_event::read_log_event(): 'Event too small', data_len: 0, event_type: 0
Could not read entry at offset 504660170:Error in log format or read error
bwdb /usr/home/database#

Which is maybe an indication that this is simply the end of the log file (there's no more output after it), so if we look at the next logfile:

bwdb /usr/home/database# mysqlbinlog bwdb-bin.001551|more
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
# at 4
#121104 23:04:07 server id 1  log_pos 4         Start: binlog v 3, server v 4.1.10a-log creat
ed 121104 23:04:07 at startup
# at 79
#121104 23:18:05 server id 1  log_pos 79        Query   thread_id=2     exec_time=0     error
_code=0
use traffic;
SET TIMESTAMP=1352099885;
insert into dailyIpTotals_69_55_239 values ('6955239101-20121104-1', '2012-11-04 19:45:00', '
69.55.239.101', 1, 344, 7) ON DUPLICATE KEY UPDATE octets=octets+344, packets=packets+7;
# at 297
#121104 23:18:05 server id 1  log_pos 297       Query   thread_id=2     exec_time=0     error
_code=0
SET TIMESTAMP=1352099885;
insert into dailyIpTotals_69_55_239 values ('6955239102-20121104-1', '2012-11-04 19:45:00', '
69.55.239.102', 1, 84, 2) ON DUPLICATE KEY UPDATE octets=octets+84, packets=packets+2;

So it looks like this is what we want, and the time stamp correlates with when the server came back online (from a crash in this instance). So we reset the slave to pull from the other log:

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Queueing master event to the relay log
                Master_Host: 10.1.4.203
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: bwdb-bin.001551
        Read_Master_Log_Pos: 45486292
             Relay_Log_File: mysqld-relay-bin.000002
              Relay_Log_Pos: 140
      Relay_Master_Log_File: bwdb-bin.001551
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table: traffic.%
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 4
            Relay_Log_Space: 45805720
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: 67238
1 row in set (0.22 sec)

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: 10.1.4.203
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: bwdb-bin.001551
        Read_Master_Log_Pos: 107339149
             Relay_Log_File: mysqld-relay-bin.000002
              Relay_Log_Pos: 464584
      Relay_Master_Log_File: bwdb-bin.001551
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table: traffic.%
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 460960
            Relay_Log_Space: 108092633
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: 55577
1 row in set (0.00 sec)

And if we continue to show status we see that the position is moving and the Seconds_Behind_Master is dropping. So we should be good.

After a bit we got:

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: 10.1.4.203
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: bwdb-bin.001551
        Read_Master_Log_Pos: 196122754
             Relay_Log_File: mysqld-relay-bin.000002
              Relay_Log_Pos: 1569496
      Relay_Master_Log_File: bwdb-bin.001551
           Slave_IO_Running: Yes
          Slave_SQL_Running: No
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table: traffic.%
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error: Query caused different errors on master and slave. Error on master: 'Incorrect key file for table '%-.200s'; try to repair it' (1034), Error on slave: 'no error' (0). Default database: 'traffic'. Query: 'insert into portTotals_69_55_225 values ('2012-11-04 19:45:00', '69.55.225.225', 1, 6, 99999, 100, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 1, 0, 12414, 240.5),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 53, 15156829.5, 64580.5),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 1024, 387, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 361, 213, 1),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 833, 15120, 60),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 395, 10534, 46),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 559, 31968, 111),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 53, 1541, 24),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 80, 80, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 81, 40, 1),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 99999, 120, 3),('2012-11-04 19:45:00', '
               Skip_Counter: 0
        Exec_Master_Log_Pos: 1558174
            Relay_Log_Space: 197497274
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

So on bwdb we did:

mysql> repair table portTotals_69_55_225;

Then on backup1: