System-generated Notifications
"snapshot rotation done on backup1"
Sent daily - expect to receive this!
Action to take: Confirm that this was received before midnight (or whenever backups start from virts/jails -> backup server). Delete email.
"RAID controller problem on backup1.johncompanies.com"
Ignoring all alarms prior to 2012-09-12-12-36-37 unitu0 drive p1 status= DEVICE-ERROR there was a WARNING event on 2012-09-14 01:59:39 there was a WARNING event on 2012-09-14 02:08:27 there was a WARNING event on 2012-09-14 03:54:47 there was a WARNING event on 2012-09-15 02:38:14 there was a WARNING event on 2012-09-15 02:59:02 there was a WARNING event on 2012-09-15 04:47:08 there was a WARNING event on 2012-09-15 04:47:31 there was a WARNING event on 2012-09-15 10:41:59 there was a WARNING event on 2012-09-15 13:25:23 there was a WARNING event on 2012-09-15 13:25:31 there was a WARNING event on 2012-09-15 13:25:54 there was a WARNING event on 2012-09-15 17:10:50 there was a WARNING event on 2012-09-18 01:17:18 there was a WARNING event on 2012-09-25 01:56:47 there was a WARNING event on 2012-09-29 02:04:14 there was a WARNING event on 2012-09-29 10:58:39 there was a WARNING event on 2012-09-29 10:59:02 there was a WARNING event on 2012-09-29 11:22:44 there was a WARNING event on 2012-09-29 13:50:48 there was a WARNING event on 2012-09-29 13:51:11 there was a WARNING event on 2012-09-29 13:51:30 there was a WARNING event on 2012-10-01 04:47:24 there was a WARNING event on 2012-10-02 02:00:27 there was a WARNING event on 2012-10-02 02:01:56 there was a WARNING event on 2012-10-02 05:02:31 there was a WARNING event on 2012-10-02 05:04:14 there was a WARNING event on 2012-10-03 01:22:12 there was a WARNING event on 2012-10-04 04:29:22 there was a WARNING event on 2012-10-04 05:10:51 there was a WARNING event on 2012-10-06 19:41:18 there was a WARNING event on 2012-10-08 00:32:06 there was a WARNING event on 2012-10-09 03:51:03 to see all status: tw_cli /c0 show all to see all alarms: tw_cli show alarms to silence old alarms: 3wraidchk shh
You get this when the cronjob running on the server notices there's been a new event in the logs, and those recent events are include.
Action to take: Review the logs on the server. See commands above and review tw_cli_Reference. Optional: clear the warning with 3wraidchk shh on the server that generated the notice. Delete email.
"bwdb2: sendsql.pl error"
scp -Cq /usr/home/sql/2012-10-29-10:30.sql.bz2 backup1:/data/bwdb2/pending/ (lost connection)
Action to take: Usually none, this is a temporary failure to transfer bandwidth stats from bwdb2 to backup1 where the database lives that tracks and supplies all b/w stats. The script will continue to attempt to send data over. Only take action if it continues to fail without obvious reason (i.e. temporary outage i2b<->castle or backup1 down)
"replication failing"
replication behind by 0 days, 12 hours, 23 minutes
This message serves as a warning that the mysql replication between bwdb and backup1 is not in sync any longer. This could be the result of 1 of 2 things:
1. replication really isn't happening anymore
2. there is a backlog of netflow(s) that bwdb is working it's way through. That's why the latest data in the DB (which is what this notification looks for- stale data) is from hours ago.
To rule out #2, on bwdb simply look at the queue of flows to be processed:
bwdb /home/flowbin# ls /usr/home/working ft-v05.2012-11-05.113000-0800 bwdb /home/flowbin#
You see there's 1 flow there ready to processes. Had there been a backlog you'd see lots of files there. So this message must mean there's a replication failure. Here's how to look at the replication status:
On bwdb:
mysql> SHOW MASTER STATUS; +-----------------+-----------+--------------+------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | +-----------------+-----------+--------------+------------------+ | bwdb-bin.001551 | 156935084 | | | +-----------------+-----------+--------------+------------------+ 1 row in set (0.02 sec)
On backup1:
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001550 Read_Master_Log_Pos: 505039281 Relay_Log_File: mysqld-relay-bin.001527 Relay_Log_Pos: 98 Relay_Master_Log_File: bwdb-bin.001550 Slave_IO_Running: No Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 505039281 Relay_Log_Space: 98 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.57 sec)
Our indicators that something is wrong come from looking at these fields from backup1:
Read_Master_Log_Pos: 505039281
This doesn't match up with what bwdb shows for log position: 156935084
Relay_Master_Log_File: bwdb-bin.001550
This doesn't match up with what bwdb shows for log file: bwdb-bin.001551
Slave_IO_Running: No
This should say Yes
Here's how this is resolved:
mysql> stop slave; mysql> reset slave; mysql> start slave; mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Queueing master event to the relay log Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001549 Read_Master_Log_Pos: 236696828 Relay_Log_File: mysqld-relay-bin.000003 Relay_Log_Pos: 133842 Relay_Master_Log_File: bwdb-bin.001549 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 132576 Relay_Log_Space: 238358016 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 343413
I'm a little perplexed why it reverted to log file bwdb-bin.001549, but we see that Slave_IO_Running and Slave_SQL_Running are Yes and if you re-run the show slave status command you will see that the Read_Master_Log_Pos is incrementing. However after running through all of the log file bwdb-bin.001550, it eventually fails:
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001550 Read_Master_Log_Pos: 504660170 Relay_Log_File: mysqld-relay-bin.000006 Relay_Log_Pos: 508191412 Relay_Master_Log_File: bwdb-bin.001550 Slave_IO_Running: No Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 1064 Last_Error: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1' on query. Default database: ''. Query: '' Skip_Counter: 0 Exec_Master_Log_Pos: 504659938 Relay_Log_Space: 508191646 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.00 sec)
Probably cause it got to a corrupted part of the log file. After looking at http://www.zmonkey.org/blog/node/157 it looks like i may need to skip over a bad part in the log and go to the next entry for the next log position. However, after performing this on bwdb:
bwdb /usr/home/database# mysqlbinlog bwdb-bin.001550 -j 504659938 /*!40019 SET @@session.max_insert_delayed_threads=0*/; # at 504659938 #121104 20:04:18 server id 1 log_pos 504659938 Query thread_id=15838 exec_time=0 error_code=0 SET TIMESTAMP=1352088258; ; ERROR: Error in Log_event::read_log_event(): 'Event too small', data_len: 0, event_type: 0 Could not read entry at offset 504660170:Error in log format or read error bwdb /usr/home/database#
Which is maybe an indication that this is simply the end of the log file (there's no more output after it), so if we look at the next logfile:
bwdb /usr/home/database# mysqlbinlog bwdb-bin.001551|more /*!40019 SET @@session.max_insert_delayed_threads=0*/; # at 4 #121104 23:04:07 server id 1 log_pos 4 Start: binlog v 3, server v 4.1.10a-log creat ed 121104 23:04:07 at startup # at 79 #121104 23:18:05 server id 1 log_pos 79 Query thread_id=2 exec_time=0 error _code=0 use traffic; SET TIMESTAMP=1352099885; insert into dailyIpTotals_69_55_239 values ('6955239101-20121104-1', '2012-11-04 19:45:00', ' 69.55.239.101', 1, 344, 7) ON DUPLICATE KEY UPDATE octets=octets+344, packets=packets+7; # at 297 #121104 23:18:05 server id 1 log_pos 297 Query thread_id=2 exec_time=0 error _code=0 SET TIMESTAMP=1352099885; insert into dailyIpTotals_69_55_239 values ('6955239102-20121104-1', '2012-11-04 19:45:00', ' 69.55.239.102', 1, 84, 2) ON DUPLICATE KEY UPDATE octets=octets+84, packets=packets+2;
So it looks like this is what we want, and the time stamp correlates with when the server came back online (from a crash in this instance). So we reset the slave to pull from the other log:
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Queueing master event to the relay log Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001551 Read_Master_Log_Pos: 45486292 Relay_Log_File: mysqld-relay-bin.000002 Relay_Log_Pos: 140 Relay_Master_Log_File: bwdb-bin.001551 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 4 Relay_Log_Space: 45805720 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 67238 1 row in set (0.22 sec) mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001551 Read_Master_Log_Pos: 107339149 Relay_Log_File: mysqld-relay-bin.000002 Relay_Log_Pos: 464584 Relay_Master_Log_File: bwdb-bin.001551 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 460960 Relay_Log_Space: 108092633 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 55577 1 row in set (0.00 sec)
And if we continue to show status we see that the position is moving and the Seconds_Behind_Master is dropping. So we should be good.
After a bit we got:
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001551 Read_Master_Log_Pos: 196122754 Relay_Log_File: mysqld-relay-bin.000002 Relay_Log_Pos: 1569496 Relay_Master_Log_File: bwdb-bin.001551 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Query caused different errors on master and slave. Error on master: 'Incorrect key file for table '%-.200s'; try to repair it' (1034), Error on slave: 'no error' (0). Default database: 'traffic'. Query: 'insert into portTotals_69_55_225 values ('2012-11-04 19:45:00', '69.55.225.225', 1, 6, 99999, 100, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 1, 0, 12414, 240.5),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 53, 15156829.5, 64580.5),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 1024, 387, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 361, 213, 1),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 833, 15120, 60),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 395, 10534, 46),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 559, 31968, 111),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 53, 1541, 24),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 80, 80, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 81, 40, 1),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 99999, 120, 3),('2012-11-04 19:45:00', ' Skip_Counter: 0 Exec_Master_Log_Pos: 1558174 Relay_Log_Space: 197497274 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.00 sec)
So on bwdb we did:
mysql> repair table portTotals_69_55_225;
Then on backup1: