System-generated Notifications: Difference between revisions
Line 180: | Line 180: | ||
Seconds_Behind_Master: 343413</pre> | Seconds_Behind_Master: 343413</pre> | ||
I'm a little perplexed why it reverted to log file <tt>bwdb-bin.001549</tt>, but we see that <tt>Slave_IO_Running</tt> and <tt>Slave_SQL_Running</tt> are Yes and if you re-run the show slave status command you will see that the <tt>Read_Master_Log_Pos</tt> is incrementing. However it eventually fails: | I'm a little perplexed why it reverted to log file <tt>bwdb-bin.001549</tt>, but we see that <tt>Slave_IO_Running</tt> and <tt>Slave_SQL_Running</tt> are Yes and if you re-run the show slave status command you will see that the <tt>Read_Master_Log_Pos</tt> is incrementing. However after running through all of the log file <tt>bwdb-bin.001550</tt>, it eventually fails: | ||
<pre>mysql> show slave status\G | <pre> | ||
mysql> show slave status\G | |||
*************************** 1. row *************************** | *************************** 1. row *************************** | ||
Slave_IO_State: | Slave_IO_State: | ||
Line 191: | Line 192: | ||
Master_Log_File: bwdb-bin.001550 | Master_Log_File: bwdb-bin.001550 | ||
Read_Master_Log_Pos: 504660170 | Read_Master_Log_Pos: 504660170 | ||
Relay_Log_File: mysqld-relay-bin. | Relay_Log_File: mysqld-relay-bin.000006 | ||
Relay_Log_Pos: | Relay_Log_Pos: 508191412 | ||
Relay_Master_Log_File: bwdb-bin. | Relay_Master_Log_File: bwdb-bin.001550 | ||
Slave_IO_Running: No | Slave_IO_Running: No | ||
Slave_SQL_Running: | Slave_SQL_Running: No | ||
Replicate_Do_DB: | Replicate_Do_DB: | ||
Replicate_Ignore_DB: | Replicate_Ignore_DB: | ||
Line 202: | Line 203: | ||
Replicate_Wild_Do_Table: traffic.% | Replicate_Wild_Do_Table: traffic.% | ||
Replicate_Wild_Ignore_Table: | Replicate_Wild_Ignore_Table: | ||
Last_Errno: | Last_Errno: 1064 | ||
Last_Error: | Last_Error: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1' on query. Default database: ''. Query: '' | ||
Skip_Counter: 0 | Skip_Counter: 0 | ||
Exec_Master_Log_Pos: | Exec_Master_Log_Pos: 504659938 | ||
Relay_Log_Space: | Relay_Log_Space: 508191646 | ||
Until_Condition: None | Until_Condition: None | ||
Until_Log_File: | Until_Log_File: | ||
Line 218: | Line 219: | ||
Seconds_Behind_Master: NULL | Seconds_Behind_Master: NULL | ||
1 row in set (0.00 sec) | 1 row in set (0.00 sec) | ||
</pre> | </pre> | ||
Probably cause it got to a corrupted part of the log file. After looking at [[http://www.zmonkey.org/blog/node/157|http://www.zmonkey.org/blog/node/157]] it looks like i may need to skip over a bad part in the log and go to the next entry for the next log position. However, after performing this on bwdb: |
Revision as of 15:32, 5 November 2012
"snapshot rotation done on backup1"
Sent daily - expect to receive this!
Action to take: Confirm that this was received before midnight (or whenever backups start from virts/jails -> backup server). Delete email.
"RAID controller problem on backup1.johncompanies.com"
Ignoring all alarms prior to 2012-09-12-12-36-37 unitu0 drive p1 status= DEVICE-ERROR there was a WARNING event on 2012-09-14 01:59:39 there was a WARNING event on 2012-09-14 02:08:27 there was a WARNING event on 2012-09-14 03:54:47 there was a WARNING event on 2012-09-15 02:38:14 there was a WARNING event on 2012-09-15 02:59:02 there was a WARNING event on 2012-09-15 04:47:08 there was a WARNING event on 2012-09-15 04:47:31 there was a WARNING event on 2012-09-15 10:41:59 there was a WARNING event on 2012-09-15 13:25:23 there was a WARNING event on 2012-09-15 13:25:31 there was a WARNING event on 2012-09-15 13:25:54 there was a WARNING event on 2012-09-15 17:10:50 there was a WARNING event on 2012-09-18 01:17:18 there was a WARNING event on 2012-09-25 01:56:47 there was a WARNING event on 2012-09-29 02:04:14 there was a WARNING event on 2012-09-29 10:58:39 there was a WARNING event on 2012-09-29 10:59:02 there was a WARNING event on 2012-09-29 11:22:44 there was a WARNING event on 2012-09-29 13:50:48 there was a WARNING event on 2012-09-29 13:51:11 there was a WARNING event on 2012-09-29 13:51:30 there was a WARNING event on 2012-10-01 04:47:24 there was a WARNING event on 2012-10-02 02:00:27 there was a WARNING event on 2012-10-02 02:01:56 there was a WARNING event on 2012-10-02 05:02:31 there was a WARNING event on 2012-10-02 05:04:14 there was a WARNING event on 2012-10-03 01:22:12 there was a WARNING event on 2012-10-04 04:29:22 there was a WARNING event on 2012-10-04 05:10:51 there was a WARNING event on 2012-10-06 19:41:18 there was a WARNING event on 2012-10-08 00:32:06 there was a WARNING event on 2012-10-09 03:51:03 to see all status: tw_cli /c0 show all to see all alarms: tw_cli show alarms to silence old alarms: 3wraidchk shh
You get this when the cronjob running on the server notices there's been a new event in the logs, and those recent events are include.
Action to take: Review the logs on the server. See commands above and review tw_cli_Reference. Optional: clear the warning with 3wraidchk shh on the server that generated the notice. Delete email.
"bwdb2: sendsql.pl error"
scp -Cq /usr/home/sql/2012-10-29-10:30.sql.bz2 backup1:/data/bwdb2/pending/ (lost connection)
Action to take: Usually none, this is a temporary failure to transfer bandwidth stats from bwdb2 to backup1 where the database lives that tracks and supplies all b/w stats. The script will continue to attempt to send data over. Only take action if it continues to fail without obvious reason (i.e. temporary outage i2b<->castle or backup1 down)
"replication failing"
replication behind by 0 days, 12 hours, 23 minutes
This message serves as a warning that the mysql replication between bwdb and backup1 is not in sync any longer. This could be the result of 1 of 2 things:
1. replication really isn't happening anymore
2. there is a backlog of netflow(s) that bwdb is working it's way through. That's why the latest data in the DB (which is what this notification looks for- stale data) is from hours ago.
To rule out #2, on bwdb simply look at the queue of flows to be processed:
bwdb /home/flowbin# ls /usr/home/working ft-v05.2012-11-05.113000-0800 bwdb /home/flowbin#
You see there's 1 flow there ready to processes. Had there been a backlog you'd see lots of files there. So this message must mean there's a replication failure. Here's how to look at the replication status:
On bwdb:
mysql> SHOW MASTER STATUS; +-----------------+-----------+--------------+------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | +-----------------+-----------+--------------+------------------+ | bwdb-bin.001551 | 156935084 | | | +-----------------+-----------+--------------+------------------+ 1 row in set (0.02 sec)
On backup1:
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001550 Read_Master_Log_Pos: 505039281 Relay_Log_File: mysqld-relay-bin.001527 Relay_Log_Pos: 98 Relay_Master_Log_File: bwdb-bin.001550 Slave_IO_Running: No Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 505039281 Relay_Log_Space: 98 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.57 sec)
Our indicators that something is wrong come from looking at these fields from backup1:
Read_Master_Log_Pos: 505039281
This doesn't match up with what bwdb shows for log position: 156935084
Relay_Master_Log_File: bwdb-bin.001550
This doesn't match up with what bwdb shows for log file: bwdb-bin.001551
Slave_IO_Running: No
This should say Yes
Here's how this is resolved:
mysql> stop slave; mysql> reset slave; mysql> start slave; mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Queueing master event to the relay log Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001549 Read_Master_Log_Pos: 236696828 Relay_Log_File: mysqld-relay-bin.000003 Relay_Log_Pos: 133842 Relay_Master_Log_File: bwdb-bin.001549 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 132576 Relay_Log_Space: 238358016 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 343413
I'm a little perplexed why it reverted to log file bwdb-bin.001549, but we see that Slave_IO_Running and Slave_SQL_Running are Yes and if you re-run the show slave status command you will see that the Read_Master_Log_Pos is incrementing. However after running through all of the log file bwdb-bin.001550, it eventually fails:
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001550 Read_Master_Log_Pos: 504660170 Relay_Log_File: mysqld-relay-bin.000006 Relay_Log_Pos: 508191412 Relay_Master_Log_File: bwdb-bin.001550 Slave_IO_Running: No Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 1064 Last_Error: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1' on query. Default database: ''. Query: '' Skip_Counter: 0 Exec_Master_Log_Pos: 504659938 Relay_Log_Space: 508191646 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.00 sec)
Probably cause it got to a corrupted part of the log file. After looking at [[1]] it looks like i may need to skip over a bad part in the log and go to the next entry for the next log position. However, after performing this on bwdb: