System-generated Notifications: Difference between revisions
No edit summary |
No edit summary |
||
Line 57: | Line 57: | ||
'''Action to take:''' Usually none, this is a temporary failure to transfer bandwidth stats from bwdb2 to backup1 where the database lives that tracks and supplies all b/w stats. The script will continue to attempt to send data over. Only take action if it continues to fail without obvious reason (i.e. temporary outage i2b<->castle or backup1 down) | '''Action to take:''' Usually none, this is a temporary failure to transfer bandwidth stats from bwdb2 to backup1 where the database lives that tracks and supplies all b/w stats. The script will continue to attempt to send data over. Only take action if it continues to fail without obvious reason (i.e. temporary outage i2b<->castle or backup1 down) | ||
= "replication failing" = | |||
replication behind by 0 days, 12 hours, 23 minutes | |||
This message serves as a warning that the mysql replication between bwdb and backup1 is not in sync any longer. This could be the result of 1 of 2 things: | |||
1. replication really isn't happening anymore | |||
2. there is a backlog of netflow(s) that bwdb is working it's way through. That's why the latest data in the DB (which is what this notification looks for- stale data) is from hours ago. | |||
To rule out #2, on bwdb simply look at the queue of flows to be processed: | |||
bwdb /home/flowbin# ls /usr/home/working | |||
ft-v05.2012-11-05.113000-0800 | |||
bwdb /home/flowbin# | |||
You see there's 1 flow there ready to processes. Had there been a backlog you'd see lots of files there. So this message must mean there's a replication failure. Here's how to look at the replication status: | |||
On bwdb: | |||
<pre>mysql> SHOW MASTER STATUS; | |||
+-----------------+-----------+--------------+------------------+ | |||
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | | |||
+-----------------+-----------+--------------+------------------+ | |||
| bwdb-bin.001551 | 156935084 | | | | |||
+-----------------+-----------+--------------+------------------+ | |||
1 row in set (0.02 sec)</pre> | |||
On backup1: | |||
<pre>mysql> show slave status\G | |||
*************************** 1. row *************************** | |||
Slave_IO_State: | |||
Master_Host: 10.1.4.203 | |||
Master_User: repl | |||
Master_Port: 3306 | |||
Connect_Retry: 60 | |||
Master_Log_File: bwdb-bin.001550 | |||
Read_Master_Log_Pos: 505039281 | |||
Relay_Log_File: mysqld-relay-bin.001527 | |||
Relay_Log_Pos: 98 | |||
Relay_Master_Log_File: bwdb-bin.001550 | |||
Slave_IO_Running: No | |||
Slave_SQL_Running: Yes | |||
Replicate_Do_DB: | |||
Replicate_Ignore_DB: | |||
Replicate_Do_Table: | |||
Replicate_Ignore_Table: | |||
Replicate_Wild_Do_Table: traffic.% | |||
Replicate_Wild_Ignore_Table: | |||
Last_Errno: 0 | |||
Last_Error: | |||
Skip_Counter: 0 | |||
Exec_Master_Log_Pos: 505039281 | |||
Relay_Log_Space: 98 | |||
Until_Condition: None | |||
Until_Log_File: | |||
Until_Log_Pos: 0 | |||
Master_SSL_Allowed: No | |||
Master_SSL_CA_File: | |||
Master_SSL_CA_Path: | |||
Master_SSL_Cert: | |||
Master_SSL_Cipher: | |||
Master_SSL_Key: | |||
Seconds_Behind_Master: NULL | |||
1 row in set (0.57 sec) | |||
</pre> | |||
Our indicators that something is wrong come from looking at these fields from backup1: | |||
Read_Master_Log_Pos: 505039281 | |||
This doesn't match up with what bwdb shows for log position: <tt>156935084</tt> | |||
Relay_Master_Log_File: bwdb-bin.001550 | |||
This doesn't match up with what bwdb shows for log file: <tt>bwdb-bin.001551</tt> | |||
Slave_IO_Running: No | |||
This should say <tt>Yes</tt> | |||
Here's how this is resolved: | |||
<pre> | |||
mysql> stop slave; | |||
mysql> reset slave; | |||
mysql> start slave; | |||
mysql> show slave status\G | |||
*************************** 1. row *************************** | |||
Slave_IO_State: Queueing master event to the relay log | |||
Master_Host: 10.1.4.203 | |||
Master_User: repl | |||
Master_Port: 3306 | |||
Connect_Retry: 60 | |||
Master_Log_File: bwdb-bin.001549 | |||
Read_Master_Log_Pos: 236696828 | |||
Relay_Log_File: mysqld-relay-bin.000003 | |||
Relay_Log_Pos: 133842 | |||
Relay_Master_Log_File: bwdb-bin.001549 | |||
Slave_IO_Running: Yes | |||
Slave_SQL_Running: Yes | |||
Replicate_Do_DB: | |||
Replicate_Ignore_DB: | |||
Replicate_Do_Table: | |||
Replicate_Ignore_Table: | |||
Replicate_Wild_Do_Table: traffic.% | |||
Replicate_Wild_Ignore_Table: | |||
Last_Errno: 0 | |||
Last_Error: | |||
Skip_Counter: 0 | |||
Exec_Master_Log_Pos: 132576 | |||
Relay_Log_Space: 238358016 | |||
Until_Condition: None | |||
Until_Log_File: | |||
Until_Log_Pos: 0 | |||
Master_SSL_Allowed: No | |||
Master_SSL_CA_File: | |||
Master_SSL_CA_Path: | |||
Master_SSL_Cert: | |||
Master_SSL_Cipher: | |||
Master_SSL_Key: | |||
Seconds_Behind_Master: 343413</pre> | |||
I'm a little perplexed why it reverted to log file <tt>bwdb-bin.001549</tt>, but we see that <tt>Slave_IO_Running</tt> and <tt>Slave_SQL_Running</tt> are Yes and if you re-run the show slave status command you will see that the <tt>Read_Master_Log_Pos</tt> is incrementing. |
Revision as of 13:02, 5 November 2012
"snapshot rotation done on backup1"
Sent daily - expect to receive this!
Action to take: Confirm that this was received before midnight (or whenever backups start from virts/jails -> backup server). Delete email.
"RAID controller problem on backup1.johncompanies.com"
Ignoring all alarms prior to 2012-09-12-12-36-37 unitu0 drive p1 status= DEVICE-ERROR there was a WARNING event on 2012-09-14 01:59:39 there was a WARNING event on 2012-09-14 02:08:27 there was a WARNING event on 2012-09-14 03:54:47 there was a WARNING event on 2012-09-15 02:38:14 there was a WARNING event on 2012-09-15 02:59:02 there was a WARNING event on 2012-09-15 04:47:08 there was a WARNING event on 2012-09-15 04:47:31 there was a WARNING event on 2012-09-15 10:41:59 there was a WARNING event on 2012-09-15 13:25:23 there was a WARNING event on 2012-09-15 13:25:31 there was a WARNING event on 2012-09-15 13:25:54 there was a WARNING event on 2012-09-15 17:10:50 there was a WARNING event on 2012-09-18 01:17:18 there was a WARNING event on 2012-09-25 01:56:47 there was a WARNING event on 2012-09-29 02:04:14 there was a WARNING event on 2012-09-29 10:58:39 there was a WARNING event on 2012-09-29 10:59:02 there was a WARNING event on 2012-09-29 11:22:44 there was a WARNING event on 2012-09-29 13:50:48 there was a WARNING event on 2012-09-29 13:51:11 there was a WARNING event on 2012-09-29 13:51:30 there was a WARNING event on 2012-10-01 04:47:24 there was a WARNING event on 2012-10-02 02:00:27 there was a WARNING event on 2012-10-02 02:01:56 there was a WARNING event on 2012-10-02 05:02:31 there was a WARNING event on 2012-10-02 05:04:14 there was a WARNING event on 2012-10-03 01:22:12 there was a WARNING event on 2012-10-04 04:29:22 there was a WARNING event on 2012-10-04 05:10:51 there was a WARNING event on 2012-10-06 19:41:18 there was a WARNING event on 2012-10-08 00:32:06 there was a WARNING event on 2012-10-09 03:51:03 to see all status: tw_cli /c0 show all to see all alarms: tw_cli show alarms to silence old alarms: 3wraidchk shh
You get this when the cronjob running on the server notices there's been a new event in the logs, and those recent events are include.
Action to take: Review the logs on the server. See commands above and review tw_cli_Reference. Optional: clear the warning with 3wraidchk shh on the server that generated the notice. Delete email.
"bwdb2: sendsql.pl error"
scp -Cq /usr/home/sql/2012-10-29-10:30.sql.bz2 backup1:/data/bwdb2/pending/ (lost connection)
Action to take: Usually none, this is a temporary failure to transfer bandwidth stats from bwdb2 to backup1 where the database lives that tracks and supplies all b/w stats. The script will continue to attempt to send data over. Only take action if it continues to fail without obvious reason (i.e. temporary outage i2b<->castle or backup1 down)
"replication failing"
replication behind by 0 days, 12 hours, 23 minutes
This message serves as a warning that the mysql replication between bwdb and backup1 is not in sync any longer. This could be the result of 1 of 2 things:
1. replication really isn't happening anymore
2. there is a backlog of netflow(s) that bwdb is working it's way through. That's why the latest data in the DB (which is what this notification looks for- stale data) is from hours ago.
To rule out #2, on bwdb simply look at the queue of flows to be processed:
bwdb /home/flowbin# ls /usr/home/working ft-v05.2012-11-05.113000-0800 bwdb /home/flowbin#
You see there's 1 flow there ready to processes. Had there been a backlog you'd see lots of files there. So this message must mean there's a replication failure. Here's how to look at the replication status:
On bwdb:
mysql> SHOW MASTER STATUS; +-----------------+-----------+--------------+------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | +-----------------+-----------+--------------+------------------+ | bwdb-bin.001551 | 156935084 | | | +-----------------+-----------+--------------+------------------+ 1 row in set (0.02 sec)
On backup1:
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001550 Read_Master_Log_Pos: 505039281 Relay_Log_File: mysqld-relay-bin.001527 Relay_Log_Pos: 98 Relay_Master_Log_File: bwdb-bin.001550 Slave_IO_Running: No Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 505039281 Relay_Log_Space: 98 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.57 sec)
Our indicators that something is wrong come from looking at these fields from backup1:
Read_Master_Log_Pos: 505039281
This doesn't match up with what bwdb shows for log position: 156935084
Relay_Master_Log_File: bwdb-bin.001550
This doesn't match up with what bwdb shows for log file: bwdb-bin.001551
Slave_IO_Running: No
This should say Yes
Here's how this is resolved:
mysql> stop slave; mysql> reset slave; mysql> start slave; mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Queueing master event to the relay log Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001549 Read_Master_Log_Pos: 236696828 Relay_Log_File: mysqld-relay-bin.000003 Relay_Log_Pos: 133842 Relay_Master_Log_File: bwdb-bin.001549 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 132576 Relay_Log_Space: 238358016 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 343413
I'm a little perplexed why it reverted to log file bwdb-bin.001549, but we see that Slave_IO_Running and Slave_SQL_Running are Yes and if you re-run the show slave status command you will see that the Read_Master_Log_Pos is incrementing.