System-generated Notifications: Difference between revisions
Created page with "= "RAID controller problem on backup1.johncompanies.com" = <pre> Ignoring all alarms prior to 2012-09-12-12-36-37 unitu0 drive p1 status= DEVICE-ERROR there was a WARNING eve..." |
|||
(31 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
= Payment-related emails = | |||
Also see [[Payment_System#Payment-related_emails|Payment-related emails]] | |||
= "snapshot rotation done on backup1" = | |||
Sent daily - expect to receive this! | |||
'''Action to take:''' Confirm that this was received before midnight (or whenever backups start from virts/jails -> backup server). Delete email. | |||
= "!BB - 9638010! jail1.johncompanies.com.conn - 600082197184218" = | |||
These are generated by [[BigBrother]] and the content will depend on the server and problem | |||
'''Action to take:''' You will want to look into the cause/reason for the notice and you may delete these emails. | |||
= "IMPORTANT REMINDER - SYSTEM SHUT DOWN TODAY" = | |||
Every night at midnight, a script ([[Infrastructure_Machines#Cronjobs|shutdownreminder.pl]]) runs which looks for systems (in [[Management_System_/_Public_Website_/_Signup_/_Account_Manager#jc:systems|jc:systems]]) which are marked with a <tt>cancel_date</tt> (but no stop date – i.e. they’re pending shutdown). If the cancel date is equal to the current date or current date + 10, 5, 2, or 1 days, it will send them a shutdown reminder email (cc’ing support@johncompanies.com). All emails are logged so there’s no need to save these cc’s. Typically, we keep the “shutdown today” reminder in the queue as a reminder that it needs to be shut down (manually, typically at the end of the day). The list of systems set to be shutdown is also listed on the main page in Mgmt. | |||
A series of these go out at the cancel date gets closer, until the day of the cancel at which point the subject switches from "SHUT DOWN IN X DAYS" to "SHUT DOWN TODAY" | |||
'''Action to take:''' delete these emails, but when you get the notice about "SHUT DOWN TODAY", that is your reminder to actually stop the system today. | |||
= "RAID controller problem on backup1.johncompanies.com" = | = "RAID controller problem on backup1.johncompanies.com" = | ||
Line 43: | Line 67: | ||
You get this when the cronjob running on the server notices there's been a new event in the logs, and those recent events are include. | You get this when the cronjob running on the server notices there's been a new event in the logs, and those recent events are include. | ||
'''Action to take:''' Review the logs on the server. See commands above and review [[tw_cli_Reference]]. Optional: clear the warning with <tt>3wraidchk shh</tt> on the server that generated the notice. | '''Action to take:''' Review the logs on the server. See commands above and review [[tw_cli_Reference]]. Optional: clear the warning with <tt>3wraidchk shh</tt> on the server that generated the notice. Delete email. | ||
= "bwdb2: sendsql.pl error" = | |||
<pre>scp -Cq /usr/home/sql/2012-10-29-10:30.sql.bz2 backup1:/data/bwdb2/pending/ (lost | |||
connection)</pre> | |||
'''Action to take:''' Usually none, this is a temporary failure to transfer bandwidth stats from bwdb2 to backup1 where the database lives that tracks and supplies all b/w stats. The script will continue to attempt to send data over. Only take action if it continues to fail without obvious reason (i.e. temporary outage i2b<->castle or backup1 down) | |||
= "replication failing" = | |||
replication behind by 0 days, 12 hours, 23 minutes | |||
This message serves as a warning that the mysql replication between bwdb and backup1 is not in sync any longer. This could be the result of 1 of 2 things: | |||
1. replication really isn't happening anymore | |||
2. there is a backlog of netflow(s) that bwdb is working it's way through. That's why the latest data in the DB (which is what this notification looks for- stale data) is from hours ago. | |||
To rule out #2, on bwdb simply look at the queue of flows to be processed: | |||
bwdb /home/flowbin# ls /usr/home/working | |||
ft-v05.2012-11-05.113000-0800 | |||
bwdb /home/flowbin# | |||
You see there's 1 flow there ready to processes. Had there been a backlog you'd see lots of files there. So this message must mean there's a replication failure. Here's how to look at the replication status: | |||
On bwdb: | |||
<pre>mysql> SHOW MASTER STATUS; | |||
+-----------------+-----------+--------------+------------------+ | |||
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | | |||
+-----------------+-----------+--------------+------------------+ | |||
| bwdb-bin.001551 | 156935084 | | | | |||
+-----------------+-----------+--------------+------------------+ | |||
1 row in set (0.02 sec)</pre> | |||
On backup1: | |||
<pre>mysql> show slave status\G | |||
*************************** 1. row *************************** | |||
Slave_IO_State: | |||
Master_Host: 10.1.4.203 | |||
Master_User: repl | |||
Master_Port: 3306 | |||
Connect_Retry: 60 | |||
Master_Log_File: bwdb-bin.001550 | |||
Read_Master_Log_Pos: 505039281 | |||
Relay_Log_File: mysqld-relay-bin.001527 | |||
Relay_Log_Pos: 98 | |||
Relay_Master_Log_File: bwdb-bin.001550 | |||
Slave_IO_Running: No | |||
Slave_SQL_Running: Yes | |||
Replicate_Do_DB: | |||
Replicate_Ignore_DB: | |||
Replicate_Do_Table: | |||
Replicate_Ignore_Table: | |||
Replicate_Wild_Do_Table: traffic.% | |||
Replicate_Wild_Ignore_Table: | |||
Last_Errno: 0 | |||
Last_Error: | |||
Skip_Counter: 0 | |||
Exec_Master_Log_Pos: 505039281 | |||
Relay_Log_Space: 98 | |||
Until_Condition: None | |||
Until_Log_File: | |||
Until_Log_Pos: 0 | |||
Master_SSL_Allowed: No | |||
Master_SSL_CA_File: | |||
Master_SSL_CA_Path: | |||
Master_SSL_Cert: | |||
Master_SSL_Cipher: | |||
Master_SSL_Key: | |||
Seconds_Behind_Master: NULL | |||
1 row in set (0.57 sec) | |||
</pre> | |||
Our indicators that something is wrong come from looking at these fields from backup1: | |||
Read_Master_Log_Pos: 505039281 | |||
This doesn't match up with what bwdb shows for log position: <tt>156935084</tt> | |||
Relay_Master_Log_File: bwdb-bin.001550 | |||
This doesn't match up with what bwdb shows for log file: <tt>bwdb-bin.001551</tt> | |||
Slave_IO_Running: No | |||
This should say <tt>Yes</tt> | |||
Here's how this is resolved: | |||
<pre> | |||
mysql> stop slave; | |||
mysql> reset slave; | |||
mysql> start slave; | |||
mysql> show slave status\G | |||
*************************** 1. row *************************** | |||
Slave_IO_State: Queueing master event to the relay log | |||
Master_Host: 10.1.4.203 | |||
Master_User: repl | |||
Master_Port: 3306 | |||
Connect_Retry: 60 | |||
Master_Log_File: bwdb-bin.001549 | |||
Read_Master_Log_Pos: 236696828 | |||
Relay_Log_File: mysqld-relay-bin.000003 | |||
Relay_Log_Pos: 133842 | |||
Relay_Master_Log_File: bwdb-bin.001549 | |||
Slave_IO_Running: Yes | |||
Slave_SQL_Running: Yes | |||
Replicate_Do_DB: | |||
Replicate_Ignore_DB: | |||
Replicate_Do_Table: | |||
Replicate_Ignore_Table: | |||
Replicate_Wild_Do_Table: traffic.% | |||
Replicate_Wild_Ignore_Table: | |||
Last_Errno: 0 | |||
Last_Error: | |||
Skip_Counter: 0 | |||
Exec_Master_Log_Pos: 132576 | |||
Relay_Log_Space: 238358016 | |||
Until_Condition: None | |||
Until_Log_File: | |||
Until_Log_Pos: 0 | |||
Master_SSL_Allowed: No | |||
Master_SSL_CA_File: | |||
Master_SSL_CA_Path: | |||
Master_SSL_Cert: | |||
Master_SSL_Cipher: | |||
Master_SSL_Key: | |||
Seconds_Behind_Master: 343413</pre> | |||
I'm a little perplexed why it reverted to log file <tt>bwdb-bin.001549</tt>, but we see that <tt>Slave_IO_Running</tt> and <tt>Slave_SQL_Running</tt> are Yes and if you re-run the show slave status command you will see that the <tt>Read_Master_Log_Pos</tt> is incrementing. However after running through all of the log file <tt>bwdb-bin.001550</tt>, it eventually fails: | |||
<pre> | |||
mysql> show slave status\G | |||
*************************** 1. row *************************** | |||
Slave_IO_State: | |||
Master_Host: 10.1.4.203 | |||
Master_User: repl | |||
Master_Port: 3306 | |||
Connect_Retry: 60 | |||
Master_Log_File: bwdb-bin.001550 | |||
Read_Master_Log_Pos: 504660170 | |||
Relay_Log_File: mysqld-relay-bin.000006 | |||
Relay_Log_Pos: 508191412 | |||
Relay_Master_Log_File: bwdb-bin.001550 | |||
Slave_IO_Running: No | |||
Slave_SQL_Running: No | |||
Replicate_Do_DB: | |||
Replicate_Ignore_DB: | |||
Replicate_Do_Table: | |||
Replicate_Ignore_Table: | |||
Replicate_Wild_Do_Table: traffic.% | |||
Replicate_Wild_Ignore_Table: | |||
Last_Errno: 1064 | |||
Last_Error: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1' on query. Default database: ''. Query: '' | |||
Skip_Counter: 0 | |||
Exec_Master_Log_Pos: 504659938 | |||
Relay_Log_Space: 508191646 | |||
Until_Condition: None | |||
Until_Log_File: | |||
Until_Log_Pos: 0 | |||
Master_SSL_Allowed: No | |||
Master_SSL_CA_File: | |||
Master_SSL_CA_Path: | |||
Master_SSL_Cert: | |||
Master_SSL_Cipher: | |||
Master_SSL_Key: | |||
Seconds_Behind_Master: NULL | |||
1 row in set (0.00 sec) | |||
</pre> | |||
Probably cause it got to a corrupted part of the log file. After looking at [http://www.zmonkey.org/blog/node/157 http://www.zmonkey.org/blog/node/157] it looks like i may need to skip over a bad part in the log and go to the next entry for the next log position. However, after performing this on bwdb: | |||
<pre> | |||
bwdb /usr/home/database# mysqlbinlog bwdb-bin.001550 -j 504659938 | |||
/*!40019 SET @@session.max_insert_delayed_threads=0*/; | |||
# at 504659938 | |||
#121104 20:04:18 server id 1 log_pos 504659938 Query thread_id=15838 exec_time=0 error_code=0 | |||
SET TIMESTAMP=1352088258; | |||
; | |||
ERROR: Error in Log_event::read_log_event(): 'Event too small', data_len: 0, event_type: 0 | |||
Could not read entry at offset 504660170:Error in log format or read error | |||
bwdb /usr/home/database# | |||
</pre> | |||
Which is maybe an indication that this is simply the end of the log file (there's no more output after it), so if we look at the next logfile: | |||
<pre> | |||
bwdb /usr/home/database# mysqlbinlog bwdb-bin.001551|more | |||
/*!40019 SET @@session.max_insert_delayed_threads=0*/; | |||
# at 4 | |||
#121104 23:04:07 server id 1 log_pos 4 Start: binlog v 3, server v 4.1.10a-log creat | |||
ed 121104 23:04:07 at startup | |||
# at 79 | |||
#121104 23:18:05 server id 1 log_pos 79 Query thread_id=2 exec_time=0 error | |||
_code=0 | |||
use traffic; | |||
SET TIMESTAMP=1352099885; | |||
insert into dailyIpTotals_69_55_239 values ('6955239101-20121104-1', '2012-11-04 19:45:00', ' | |||
69.55.239.101', 1, 344, 7) ON DUPLICATE KEY UPDATE octets=octets+344, packets=packets+7; | |||
# at 297 | |||
#121104 23:18:05 server id 1 log_pos 297 Query thread_id=2 exec_time=0 error | |||
_code=0 | |||
SET TIMESTAMP=1352099885; | |||
insert into dailyIpTotals_69_55_239 values ('6955239102-20121104-1', '2012-11-04 19:45:00', ' | |||
69.55.239.102', 1, 84, 2) ON DUPLICATE KEY UPDATE octets=octets+84, packets=packets+2; | |||
</pre> | |||
So it looks like this is what we want, and the time stamp correlates with when the server came back online (from a crash in this instance). So we reset the slave to pull from the other log: | |||
<pre> | |||
mysql> show slave status\G | |||
*************************** 1. row *************************** | |||
Slave_IO_State: Queueing master event to the relay log | |||
Master_Host: 10.1.4.203 | |||
Master_User: repl | |||
Master_Port: 3306 | |||
Connect_Retry: 60 | |||
Master_Log_File: bwdb-bin.001551 | |||
Read_Master_Log_Pos: 45486292 | |||
Relay_Log_File: mysqld-relay-bin.000002 | |||
Relay_Log_Pos: 140 | |||
Relay_Master_Log_File: bwdb-bin.001551 | |||
Slave_IO_Running: Yes | |||
Slave_SQL_Running: Yes | |||
Replicate_Do_DB: | |||
Replicate_Ignore_DB: | |||
Replicate_Do_Table: | |||
Replicate_Ignore_Table: | |||
Replicate_Wild_Do_Table: traffic.% | |||
Replicate_Wild_Ignore_Table: | |||
Last_Errno: 0 | |||
Last_Error: | |||
Skip_Counter: 0 | |||
Exec_Master_Log_Pos: 4 | |||
Relay_Log_Space: 45805720 | |||
Until_Condition: None | |||
Until_Log_File: | |||
Until_Log_Pos: 0 | |||
Master_SSL_Allowed: No | |||
Master_SSL_CA_File: | |||
Master_SSL_CA_Path: | |||
Master_SSL_Cert: | |||
Master_SSL_Cipher: | |||
Master_SSL_Key: | |||
Seconds_Behind_Master: 67238 | |||
1 row in set (0.22 sec) | |||
mysql> show slave status\G | |||
*************************** 1. row *************************** | |||
Slave_IO_State: Waiting for master to send event | |||
Master_Host: 10.1.4.203 | |||
Master_User: repl | |||
Master_Port: 3306 | |||
Connect_Retry: 60 | |||
Master_Log_File: bwdb-bin.001551 | |||
Read_Master_Log_Pos: 107339149 | |||
Relay_Log_File: mysqld-relay-bin.000002 | |||
Relay_Log_Pos: 464584 | |||
Relay_Master_Log_File: bwdb-bin.001551 | |||
Slave_IO_Running: Yes | |||
Slave_SQL_Running: Yes | |||
Replicate_Do_DB: | |||
Replicate_Ignore_DB: | |||
Replicate_Do_Table: | |||
Replicate_Ignore_Table: | |||
Replicate_Wild_Do_Table: traffic.% | |||
Replicate_Wild_Ignore_Table: | |||
Last_Errno: 0 | |||
Last_Error: | |||
Skip_Counter: 0 | |||
Exec_Master_Log_Pos: 460960 | |||
Relay_Log_Space: 108092633 | |||
Until_Condition: None | |||
Until_Log_File: | |||
Until_Log_Pos: 0 | |||
Master_SSL_Allowed: No | |||
Master_SSL_CA_File: | |||
Master_SSL_CA_Path: | |||
Master_SSL_Cert: | |||
Master_SSL_Cipher: | |||
Master_SSL_Key: | |||
Seconds_Behind_Master: 55577 | |||
1 row in set (0.00 sec) | |||
</pre> | |||
And if we continue to show status we see that the position is moving and the <tt>Seconds_Behind_Master</tt> is dropping. So we should be good. | |||
After a bit we got: | |||
<pre>mysql> show slave status\G | |||
*************************** 1. row *************************** | |||
Slave_IO_State: Waiting for master to send event | |||
Master_Host: 10.1.4.203 | |||
Master_User: repl | |||
Master_Port: 3306 | |||
Connect_Retry: 60 | |||
Master_Log_File: bwdb-bin.001551 | |||
Read_Master_Log_Pos: 196122754 | |||
Relay_Log_File: mysqld-relay-bin.000002 | |||
Relay_Log_Pos: 1569496 | |||
Relay_Master_Log_File: bwdb-bin.001551 | |||
Slave_IO_Running: Yes | |||
Slave_SQL_Running: No | |||
Replicate_Do_DB: | |||
Replicate_Ignore_DB: | |||
Replicate_Do_Table: | |||
Replicate_Ignore_Table: | |||
Replicate_Wild_Do_Table: traffic.% | |||
Replicate_Wild_Ignore_Table: | |||
Last_Errno: 0 | |||
Last_Error: Query caused different errors on master and slave. Error on master: 'Incorrect key file for table '%-.200s'; try to repair it' (1034), Error on slave: 'no error' (0). Default database: 'traffic'. Query: 'insert into portTotals_69_55_225 values ('2012-11-04 19:45:00', '69.55.225.225', 1, 6, 99999, 100, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 1, 0, 12414, 240.5),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 53, 15156829.5, 64580.5),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 1024, 387, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 361, 213, 1),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 833, 15120, 60),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 395, 10534, 46),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 559, 31968, 111),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 53, 1541, 24),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 80, 80, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 81, 40, 1),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 99999, 120, 3),('2012-11-04 19:45:00', ' | |||
Skip_Counter: 0 | |||
Exec_Master_Log_Pos: 1558174 | |||
Relay_Log_Space: 197497274 | |||
Until_Condition: None | |||
Until_Log_File: | |||
Until_Log_Pos: 0 | |||
Master_SSL_Allowed: No | |||
Master_SSL_CA_File: | |||
Master_SSL_CA_Path: | |||
Master_SSL_Cert: | |||
Master_SSL_Cipher: | |||
Master_SSL_Key: | |||
Seconds_Behind_Master: NULL | |||
1 row in set (0.00 sec) | |||
</pre> | |||
So on bwdb we did: | |||
<pre>mysql> repair table portTotals_69_55_225;</pre> | |||
Then on backup1 after we stop, reset, start slave it eventually bombs again: | |||
<pre> | |||
mysql> show slave status\G | |||
*************************** 1. row *************************** | |||
Slave_IO_State: Waiting for master to send event | |||
Master_Host: 10.1.4.203 | |||
Master_User: repl | |||
Master_Port: 3306 | |||
Connect_Retry: 60 | |||
Master_Log_File: bwdb-bin.001553 | |||
Read_Master_Log_Pos: 99377008 | |||
Relay_Log_File: mysqld-relay-bin.000002 | |||
Relay_Log_Pos: 1569496 | |||
Relay_Master_Log_File: bwdb-bin.001551 | |||
Slave_IO_Running: Yes | |||
Slave_SQL_Running: No | |||
Replicate_Do_DB: | |||
Replicate_Ignore_DB: | |||
Replicate_Do_Table: | |||
Replicate_Ignore_Table: | |||
Replicate_Wild_Do_Table: traffic.% | |||
Replicate_Wild_Ignore_Table: | |||
Last_Errno: 0 | |||
Last_Error: Query caused different errors on master and slave. Error on master: 'Incorrect key file for table '%-.200s'; try to repair it' (1034), Error on slave: 'no error' (0). Default database: 'traffic'. Query: 'insert into portTotals_69_55_225 values ('2012-11-04 19:45:00', '69.55.225.225', 1, 6, 99999, 100, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 1, 0, 12414, 240.5),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 53, 15156829.5, 64580.5),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 1024, 387, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 361, 213, 1),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 833, 15120, 60),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 395, 10534, 46),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 559, 31968, 111),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 53, 1541, 24),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 80, 80, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 81, 40, 1),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 99999, 120, 3),('2012-11-04 19:45:00', ' | |||
Skip_Counter: 0 | |||
Exec_Master_Log_Pos: 1558174 | |||
Relay_Log_Space: 298937975 | |||
Until_Condition: None | |||
Until_Log_File: | |||
Until_Log_Pos: 0 | |||
Master_SSL_Allowed: No | |||
Master_SSL_CA_File: | |||
Master_SSL_CA_Path: | |||
Master_SSL_Cert: | |||
Master_SSL_Cipher: | |||
Master_SSL_Key: | |||
Seconds_Behind_Master: NULL | |||
1 row in set (0.00 sec) | |||
</pre> | |||
So at this point we will revert to the poor man's method: Plan B | |||
What we do here is delete all data in both databases in all tables from the day of the outage forward. Restart the sync. Then we re-process the flow files on bwdb from the day of the outage, and given enough time it catch up will rebuild the database (which will properly replicate as the processing continues). So, given an outage on 11/4/12 and today is the 6th: | |||
1. comment out the processing script temporarily: | |||
<pre> | |||
#process flow data | |||
#2,17,32,47 * * * * /usr/home/flowbin/processflows.pl | |||
</pre> | |||
2. we create an sql file with all the commands we want to run: | |||
<pre> | |||
cat rmdata.sql | |||
delete from ipTotals_69_55_224 where date >='2011-11-04'; | |||
delete from ipTotals_69_55_225 where date >='2011-11-04'; | |||
delete from ipTotals_69_55_226 where date >='2011-11-04'; | |||
delete from ipTotals_69_55_227 where date >='2011-11-04'; | |||
delete from ipTotals_69_55_228 where date >='2011-11-04'; | |||
delete from ipTotals_69_55_230 where date >='2011-11-04'; | |||
delete from ipTotals_69_55_232 where date >='2011-11-04'; | |||
delete from ipTotals_69_55_233 where date >='2011-11-04'; | |||
delete from ipTotals_69_55_234 where date >='2011-11-04'; | |||
delete from ipTotals_69_55_235 where date >='2011-11-04'; | |||
delete from ipTotals_69_55_236 where date >='2011-11-04'; | |||
delete from ipTotals_69_55_237 where date >='2011-11-04'; | |||
delete from ipTotals_69_55_238 where date >='2011-11-04'; | |||
delete from ipTotals_69_55_239 where date >='2011-11-04'; | |||
delete from dailyIpTotals_69_55_224 where date >='2011-11-04'; | |||
delete from dailyIpTotals_69_55_225 where date >='2011-11-04'; | |||
delete from dailyIpTotals_69_55_226 where date >='2011-11-04'; | |||
delete from dailyIpTotals_69_55_227 where date >='2011-11-04'; | |||
delete from dailyIpTotals_69_55_228 where date >='2011-11-04'; | |||
delete from dailyIpTotals_69_55_230 where date >='2011-11-04'; | |||
delete from dailyIpTotals_69_55_232 where date >='2011-11-04'; | |||
delete from dailyIpTotals_69_55_233 where date >='2011-11-04'; | |||
delete from dailyIpTotals_69_55_234 where date >='2011-11-04'; | |||
delete from dailyIpTotals_69_55_235 where date >='2011-11-04'; | |||
delete from dailyIpTotals_69_55_236 where date >='2011-11-04'; | |||
delete from dailyIpTotals_69_55_237 where date >='2011-11-04'; | |||
delete from dailyIpTotals_69_55_238 where date >='2011-11-04'; | |||
delete from dailyIpTotals_69_55_239 where date >='2011-11-04'; | |||
delete from portTotals_69_55_224 where date >='2011-11-04'; | |||
delete from portTotals_69_55_225 where date >='2011-11-04'; | |||
delete from portTotals_69_55_226 where date >='2011-11-04'; | |||
delete from portTotals_69_55_227 where date >='2011-11-04'; | |||
delete from portTotals_69_55_228 where date >='2011-11-04'; | |||
delete from portTotals_69_55_230 where date >='2011-11-04'; | |||
delete from portTotals_69_55_232 where date >='2011-11-04'; | |||
delete from portTotals_69_55_233 where date >='2011-11-04'; | |||
delete from portTotals_69_55_234 where date >='2011-11-04'; | |||
delete from portTotals_69_55_235 where date >='2011-11-04'; | |||
delete from portTotals_69_55_236 where date >='2011-11-04'; | |||
delete from portTotals_69_55_237 where date >='2011-11-04'; | |||
delete from portTotals_69_55_238 where date >='2011-11-04'; | |||
delete from portTotals_69_55_239 where date >='2011-11-04'; | |||
delete from dailyPortTotals_69_55_224 where date >='2011-11-04'; | |||
delete from dailyPortTotals_69_55_225 where date >='2011-11-04'; | |||
delete from dailyPortTotals_69_55_226 where date >='2011-11-04'; | |||
delete from dailyPortTotals_69_55_227 where date >='2011-11-04'; | |||
delete from dailyPortTotals_69_55_228 where date >='2011-11-04'; | |||
delete from dailyPortTotals_69_55_230 where date >='2011-11-04'; | |||
delete from dailyPortTotals_69_55_232 where date >='2011-11-04'; | |||
delete from dailyPortTotals_69_55_233 where date >='2011-11-04'; | |||
delete from dailyPortTotals_69_55_234 where date >='2011-11-04'; | |||
delete from dailyPortTotals_69_55_235 where date >='2011-11-04'; | |||
delete from dailyPortTotals_69_55_236 where date >='2011-11-04'; | |||
delete from dailyPortTotals_69_55_237 where date >='2011-11-04'; | |||
delete from dailyPortTotals_69_55_238 where date >='2011-11-04'; | |||
delete from dailyPortTotals_69_55_239 where date >='2011-11-04'; | |||
</pre> | |||
Run it: <tt>mysql -p traffic < rmdata.sql</tt> on bwdb AND backup1 (this will take some time to complete) | |||
3. replace all the flow files for re-processing | |||
<pre>cp /usr/home/archive/ft-v05.2012-11-04.* /usr/home/working/ | |||
cp /usr/home/archive/ft-v05.2012-11-05.* /usr/home/working/ | |||
cp /usr/home/archive/ft-v05.2012-11-06.* /usr/home/working/ | |||
</pre> | |||
Look for any temporary files at the time of the crash, move and rename: | |||
<pre> | |||
mv /usr/home/flows/2012-11/2012-11-04/tmp-v05.2012-11-04.200000-0800 /usr/home/archive/ft-v05.2012-11-04.200000-0800 | |||
</pre> | |||
4. restart slaving | |||
On bwdb: | |||
<pre> | |||
mysql> show master status; | |||
+-----------------+----------+--------------+------------------+ | |||
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | | |||
+-----------------+----------+--------------+------------------+ | |||
| bwdb-bin.001557 | 5612 | | | | |||
+-----------------+----------+--------------+------------------+ | |||
1 row in set (0.00 sec) | |||
</pre> | |||
On backup1: | |||
<pre> | |||
mysql> stop slave; | |||
Query OK, 0 rows affected, 1 warning (0.00 sec) | |||
mysql> CHANGE MASTER TO MASTER_HOST='10.1.4.203',MASTER_USER='repl',MASTER_PASSWORD='qERUG8wf',MASTER_LOG_FILE='bwdb-bin.001557',MASTER_LOG_POS=5612; | |||
Query OK, 0 rows affected (1.35 sec) | |||
mysql> start slave; | |||
Query OK, 0 rows affected (0.00 sec) | |||
mysql> show slave status\G; | |||
*************************** 1. row *************************** | |||
Slave_IO_State: Waiting for master to send event | |||
Master_Host: 10.1.4.203 | |||
Master_User: repl | |||
Master_Port: 3306 | |||
Connect_Retry: 60 | |||
Master_Log_File: bwdb-bin.001557 | |||
Read_Master_Log_Pos: 5612 | |||
Relay_Log_File: mysqld-relay-bin.000002 | |||
Relay_Log_Pos: 140 | |||
Relay_Master_Log_File: bwdb-bin.001557 | |||
Slave_IO_Running: Yes | |||
Slave_SQL_Running: Yes | |||
Replicate_Do_DB: | |||
Replicate_Ignore_DB: | |||
Replicate_Do_Table: | |||
Replicate_Ignore_Table: | |||
Replicate_Wild_Do_Table: traffic.% | |||
Replicate_Wild_Ignore_Table: | |||
Last_Errno: 0 | |||
Last_Error: | |||
Skip_Counter: 0 | |||
Exec_Master_Log_Pos: 5612 | |||
Relay_Log_Space: 140 | |||
Until_Condition: None | |||
Until_Log_File: | |||
Until_Log_Pos: 0 | |||
Master_SSL_Allowed: No | |||
Master_SSL_CA_File: | |||
Master_SSL_CA_Path: | |||
Master_SSL_Cert: | |||
Master_SSL_Cipher: | |||
Master_SSL_Key: | |||
Seconds_Behind_Master: 0 | |||
1 row in set (0.00 sec) | |||
</pre> | |||
5. on bwdb re-process the flows: <tt>/usr/home/flowbin/processflows.pl</tt> | |||
Shortly thereafter you should see the <tt>Relay_Master_Log_File</tt> on backup1 increase as it follows bwdb's replication. You should also see the list of flow files in the queue start to shrink (at least the ones from days ago will start to disappear). | |||
6. once <tt>processflows.pl</tt> is done, uncomment the cronjob to let it start running on its own | |||
= "backup1: processsql.pl error" = | |||
<pre>bzcat /data/bwdb2/pending/2012-11-09-04:30.sql.bz2.done | mysql -u bwdb2 -pxxxxx traffic (ERROR 2013 (HY000): Lost connection to MySQL server at 'sending authentication information', system error: 32)</pre> | |||
This is the result of the mysql server not running to accept the data from bwdb2. | |||
'''Action to take:''' None assuming you get just a couple/few of these, and it was due to an explained outage. | |||
= "bwdb2: sendsql.pl error" = | |||
<pre>scp -Cq /usr/home/sql/2012-11-08-16:30.sql.bz2 backup1:/data/bwdb2/pending/ (lost connection)</pre> | |||
This is usually the result of a network disruption or outage on backup1/bwdb2 causing the bandwidth file to not be transferred to backup1 to be imported into the traffic database. This is nothing to be too concerned about, it should retry this transfer on the next cron run. | |||
'''Action to take:''' None if you get just a couple/few of these, and it was due to an explained outage. | |||
= "RAID controller problem on newbackup3" = | |||
1 ARC-1160-VOL#00 Raid Set # 00 Raid5 5000.0GB 00/00/00 Checking(77%) | |||
'''Action to take:''' None, assuming you initiated the verify. | |||
= "Bandwidth limit notification" = | |||
A process on mail ([[Infrastructure_Machines#Cronjobs|top20customer.pl]]) periodically looks for customers who are nearing or at their bandwidth limit (which varies based on the package and # of systems), and emails support@johncompanies.com & sales@johncompanies.com when they’re getting close. Customers have 3 options for how to handle bandwidth overage: | |||
# stop – they want traffic to be stopped (we cap, not stop actually) when they reach their allotment | |||
# limit – they want to be billed for extra traffic, up to a certain amount | |||
# go – they want to be billed for any amount of extra traffic | |||
There are 6 conditions which are met that will generate this email notice, 3 conditions are warnings (when they’re within 10% of the next action/level) and 3 are limits – when they’ve reached the limt/level. | |||
So, given option 1, a warning will come at 90% ( “warn (cap)”), and again when all bandwidth is used ( “limit (cap)”), | |||
Given option 2, a warning will come at 90% (“warn (limit)”), and again when all bandwidth and authorized overage is used (“limit (limit)”), | |||
Given option 3. a warning will come at 90% (“warn (go)”) and again when all bandwidth is used (“limit (go)”). | |||
When the email is sent to support about the bandwidth warning: | |||
<pre>Subject: Bandwidth limit notification | |||
customer # usage level b/w option | |||
col00688 1 45.05G warn limit to $40</pre> | |||
go to the customer’s page (for each customer listed in the email) and click the “send notice” link- a customized email will be sent to the contacts listed for the account (cc’d to payments and support) indicating what the situation is. | |||
After someone has reached the end of their purchased or included bandwidth, they should be stopped, but we don’t do that – we cap them at 256k. Linux customers are easily capped using bwcap, everyone else is capped in the firewall. Linux customers will automatically uncap on the first of the month, firewall capped customers must be manually uncapped. | |||
On the first of the month a bandwidth report is sent out to support detailing who should be billed for overage. | |||
There are pastes containing the text that’s automatically sent to customers when you click “send notice” (Mgmt -> pastes). | |||
'''Action to take:''' After reviewing whether to contact the customer or limit bandwidth, you can send them a notice via the 'Send Notice' link in the bandwidth section of their mgmt page. The notice will be tailored to the current situation (warning vs over). The email can be deleted. | |||
= "Bandwidth Overage Report for March 2010" = | |||
<pre>customer avail. usage b/w option | |||
col00393 250 464.37 go/bill | |||
col01556 200 261.28 go/bill | |||
customer overage charge | |||
col00393 214.37 $160.7775 | |||
col01556 61.28 $45.96 | |||
Total collection amount: $206.7375</pre> | |||
'''Action to take:''' Use this report to decide if and how much to bill a customer for bandwidth usage. Save/archive the report email. | |||
= Backup failures = | |||
<pre>jail1 at 2012-12-12 : rsync.backup attempted to backup /mnt/data1/69.55.233.233-col01243/asd but that directory is non-existent.</pre> | |||
This is an automated email sent out by the vps backup scripts when they run into a problem. It means a directory was removed by a customer or perhaps the vps was cancelled and directories were not removed from the nightly backup script, or perhaps we moved their mount vps point/location and didn't update the backup config. | |||
'''Action to take:''' Rectify the problem by removing or renaming the dir to backup (contact the customer if this was not cause we moved the customer due to cancellation or other reason). Delete the email notice. |
Latest revision as of 15:56, 28 February 2013
[edit]
Also see Payment-related emails
"snapshot rotation done on backup1"[edit]
Sent daily - expect to receive this!
Action to take: Confirm that this was received before midnight (or whenever backups start from virts/jails -> backup server). Delete email.
"!BB - 9638010! jail1.johncompanies.com.conn - 600082197184218"[edit]
These are generated by BigBrother and the content will depend on the server and problem
Action to take: You will want to look into the cause/reason for the notice and you may delete these emails.
"IMPORTANT REMINDER - SYSTEM SHUT DOWN TODAY"[edit]
Every night at midnight, a script (shutdownreminder.pl) runs which looks for systems (in jc:systems) which are marked with a cancel_date (but no stop date – i.e. they’re pending shutdown). If the cancel date is equal to the current date or current date + 10, 5, 2, or 1 days, it will send them a shutdown reminder email (cc’ing support@johncompanies.com). All emails are logged so there’s no need to save these cc’s. Typically, we keep the “shutdown today” reminder in the queue as a reminder that it needs to be shut down (manually, typically at the end of the day). The list of systems set to be shutdown is also listed on the main page in Mgmt.
A series of these go out at the cancel date gets closer, until the day of the cancel at which point the subject switches from "SHUT DOWN IN X DAYS" to "SHUT DOWN TODAY"
Action to take: delete these emails, but when you get the notice about "SHUT DOWN TODAY", that is your reminder to actually stop the system today.
"RAID controller problem on backup1.johncompanies.com"[edit]
Ignoring all alarms prior to 2012-09-12-12-36-37 unitu0 drive p1 status= DEVICE-ERROR there was a WARNING event on 2012-09-14 01:59:39 there was a WARNING event on 2012-09-14 02:08:27 there was a WARNING event on 2012-09-14 03:54:47 there was a WARNING event on 2012-09-15 02:38:14 there was a WARNING event on 2012-09-15 02:59:02 there was a WARNING event on 2012-09-15 04:47:08 there was a WARNING event on 2012-09-15 04:47:31 there was a WARNING event on 2012-09-15 10:41:59 there was a WARNING event on 2012-09-15 13:25:23 there was a WARNING event on 2012-09-15 13:25:31 there was a WARNING event on 2012-09-15 13:25:54 there was a WARNING event on 2012-09-15 17:10:50 there was a WARNING event on 2012-09-18 01:17:18 there was a WARNING event on 2012-09-25 01:56:47 there was a WARNING event on 2012-09-29 02:04:14 there was a WARNING event on 2012-09-29 10:58:39 there was a WARNING event on 2012-09-29 10:59:02 there was a WARNING event on 2012-09-29 11:22:44 there was a WARNING event on 2012-09-29 13:50:48 there was a WARNING event on 2012-09-29 13:51:11 there was a WARNING event on 2012-09-29 13:51:30 there was a WARNING event on 2012-10-01 04:47:24 there was a WARNING event on 2012-10-02 02:00:27 there was a WARNING event on 2012-10-02 02:01:56 there was a WARNING event on 2012-10-02 05:02:31 there was a WARNING event on 2012-10-02 05:04:14 there was a WARNING event on 2012-10-03 01:22:12 there was a WARNING event on 2012-10-04 04:29:22 there was a WARNING event on 2012-10-04 05:10:51 there was a WARNING event on 2012-10-06 19:41:18 there was a WARNING event on 2012-10-08 00:32:06 there was a WARNING event on 2012-10-09 03:51:03 to see all status: tw_cli /c0 show all to see all alarms: tw_cli show alarms to silence old alarms: 3wraidchk shh
You get this when the cronjob running on the server notices there's been a new event in the logs, and those recent events are include.
Action to take: Review the logs on the server. See commands above and review tw_cli_Reference. Optional: clear the warning with 3wraidchk shh on the server that generated the notice. Delete email.
"bwdb2: sendsql.pl error"[edit]
scp -Cq /usr/home/sql/2012-10-29-10:30.sql.bz2 backup1:/data/bwdb2/pending/ (lost connection)
Action to take: Usually none, this is a temporary failure to transfer bandwidth stats from bwdb2 to backup1 where the database lives that tracks and supplies all b/w stats. The script will continue to attempt to send data over. Only take action if it continues to fail without obvious reason (i.e. temporary outage i2b<->castle or backup1 down)
"replication failing"[edit]
replication behind by 0 days, 12 hours, 23 minutes
This message serves as a warning that the mysql replication between bwdb and backup1 is not in sync any longer. This could be the result of 1 of 2 things:
1. replication really isn't happening anymore
2. there is a backlog of netflow(s) that bwdb is working it's way through. That's why the latest data in the DB (which is what this notification looks for- stale data) is from hours ago.
To rule out #2, on bwdb simply look at the queue of flows to be processed:
bwdb /home/flowbin# ls /usr/home/working ft-v05.2012-11-05.113000-0800 bwdb /home/flowbin#
You see there's 1 flow there ready to processes. Had there been a backlog you'd see lots of files there. So this message must mean there's a replication failure. Here's how to look at the replication status:
On bwdb:
mysql> SHOW MASTER STATUS; +-----------------+-----------+--------------+------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | +-----------------+-----------+--------------+------------------+ | bwdb-bin.001551 | 156935084 | | | +-----------------+-----------+--------------+------------------+ 1 row in set (0.02 sec)
On backup1:
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001550 Read_Master_Log_Pos: 505039281 Relay_Log_File: mysqld-relay-bin.001527 Relay_Log_Pos: 98 Relay_Master_Log_File: bwdb-bin.001550 Slave_IO_Running: No Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 505039281 Relay_Log_Space: 98 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.57 sec)
Our indicators that something is wrong come from looking at these fields from backup1:
Read_Master_Log_Pos: 505039281
This doesn't match up with what bwdb shows for log position: 156935084
Relay_Master_Log_File: bwdb-bin.001550
This doesn't match up with what bwdb shows for log file: bwdb-bin.001551
Slave_IO_Running: No
This should say Yes
Here's how this is resolved:
mysql> stop slave; mysql> reset slave; mysql> start slave; mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Queueing master event to the relay log Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001549 Read_Master_Log_Pos: 236696828 Relay_Log_File: mysqld-relay-bin.000003 Relay_Log_Pos: 133842 Relay_Master_Log_File: bwdb-bin.001549 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 132576 Relay_Log_Space: 238358016 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 343413
I'm a little perplexed why it reverted to log file bwdb-bin.001549, but we see that Slave_IO_Running and Slave_SQL_Running are Yes and if you re-run the show slave status command you will see that the Read_Master_Log_Pos is incrementing. However after running through all of the log file bwdb-bin.001550, it eventually fails:
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001550 Read_Master_Log_Pos: 504660170 Relay_Log_File: mysqld-relay-bin.000006 Relay_Log_Pos: 508191412 Relay_Master_Log_File: bwdb-bin.001550 Slave_IO_Running: No Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 1064 Last_Error: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1' on query. Default database: ''. Query: '' Skip_Counter: 0 Exec_Master_Log_Pos: 504659938 Relay_Log_Space: 508191646 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.00 sec)
Probably cause it got to a corrupted part of the log file. After looking at http://www.zmonkey.org/blog/node/157 it looks like i may need to skip over a bad part in the log and go to the next entry for the next log position. However, after performing this on bwdb:
bwdb /usr/home/database# mysqlbinlog bwdb-bin.001550 -j 504659938 /*!40019 SET @@session.max_insert_delayed_threads=0*/; # at 504659938 #121104 20:04:18 server id 1 log_pos 504659938 Query thread_id=15838 exec_time=0 error_code=0 SET TIMESTAMP=1352088258; ; ERROR: Error in Log_event::read_log_event(): 'Event too small', data_len: 0, event_type: 0 Could not read entry at offset 504660170:Error in log format or read error bwdb /usr/home/database#
Which is maybe an indication that this is simply the end of the log file (there's no more output after it), so if we look at the next logfile:
bwdb /usr/home/database# mysqlbinlog bwdb-bin.001551|more /*!40019 SET @@session.max_insert_delayed_threads=0*/; # at 4 #121104 23:04:07 server id 1 log_pos 4 Start: binlog v 3, server v 4.1.10a-log creat ed 121104 23:04:07 at startup # at 79 #121104 23:18:05 server id 1 log_pos 79 Query thread_id=2 exec_time=0 error _code=0 use traffic; SET TIMESTAMP=1352099885; insert into dailyIpTotals_69_55_239 values ('6955239101-20121104-1', '2012-11-04 19:45:00', ' 69.55.239.101', 1, 344, 7) ON DUPLICATE KEY UPDATE octets=octets+344, packets=packets+7; # at 297 #121104 23:18:05 server id 1 log_pos 297 Query thread_id=2 exec_time=0 error _code=0 SET TIMESTAMP=1352099885; insert into dailyIpTotals_69_55_239 values ('6955239102-20121104-1', '2012-11-04 19:45:00', ' 69.55.239.102', 1, 84, 2) ON DUPLICATE KEY UPDATE octets=octets+84, packets=packets+2;
So it looks like this is what we want, and the time stamp correlates with when the server came back online (from a crash in this instance). So we reset the slave to pull from the other log:
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Queueing master event to the relay log Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001551 Read_Master_Log_Pos: 45486292 Relay_Log_File: mysqld-relay-bin.000002 Relay_Log_Pos: 140 Relay_Master_Log_File: bwdb-bin.001551 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 4 Relay_Log_Space: 45805720 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 67238 1 row in set (0.22 sec) mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001551 Read_Master_Log_Pos: 107339149 Relay_Log_File: mysqld-relay-bin.000002 Relay_Log_Pos: 464584 Relay_Master_Log_File: bwdb-bin.001551 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 460960 Relay_Log_Space: 108092633 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 55577 1 row in set (0.00 sec)
And if we continue to show status we see that the position is moving and the Seconds_Behind_Master is dropping. So we should be good.
After a bit we got:
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001551 Read_Master_Log_Pos: 196122754 Relay_Log_File: mysqld-relay-bin.000002 Relay_Log_Pos: 1569496 Relay_Master_Log_File: bwdb-bin.001551 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Query caused different errors on master and slave. Error on master: 'Incorrect key file for table '%-.200s'; try to repair it' (1034), Error on slave: 'no error' (0). Default database: 'traffic'. Query: 'insert into portTotals_69_55_225 values ('2012-11-04 19:45:00', '69.55.225.225', 1, 6, 99999, 100, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 1, 0, 12414, 240.5),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 53, 15156829.5, 64580.5),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 1024, 387, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 361, 213, 1),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 833, 15120, 60),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 395, 10534, 46),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 559, 31968, 111),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 53, 1541, 24),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 80, 80, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 81, 40, 1),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 99999, 120, 3),('2012-11-04 19:45:00', ' Skip_Counter: 0 Exec_Master_Log_Pos: 1558174 Relay_Log_Space: 197497274 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.00 sec)
So on bwdb we did:
mysql> repair table portTotals_69_55_225;
Then on backup1 after we stop, reset, start slave it eventually bombs again:
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001553 Read_Master_Log_Pos: 99377008 Relay_Log_File: mysqld-relay-bin.000002 Relay_Log_Pos: 1569496 Relay_Master_Log_File: bwdb-bin.001551 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Query caused different errors on master and slave. Error on master: 'Incorrect key file for table '%-.200s'; try to repair it' (1034), Error on slave: 'no error' (0). Default database: 'traffic'. Query: 'insert into portTotals_69_55_225 values ('2012-11-04 19:45:00', '69.55.225.225', 1, 6, 99999, 100, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 1, 0, 12414, 240.5),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 53, 15156829.5, 64580.5),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 1024, 387, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 361, 213, 1),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 833, 15120, 60),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 395, 10534, 46),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 559, 31968, 111),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 53, 1541, 24),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 80, 80, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 81, 40, 1),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 99999, 120, 3),('2012-11-04 19:45:00', ' Skip_Counter: 0 Exec_Master_Log_Pos: 1558174 Relay_Log_Space: 298937975 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.00 sec)
So at this point we will revert to the poor man's method: Plan B
What we do here is delete all data in both databases in all tables from the day of the outage forward. Restart the sync. Then we re-process the flow files on bwdb from the day of the outage, and given enough time it catch up will rebuild the database (which will properly replicate as the processing continues). So, given an outage on 11/4/12 and today is the 6th:
1. comment out the processing script temporarily:
#process flow data #2,17,32,47 * * * * /usr/home/flowbin/processflows.pl
2. we create an sql file with all the commands we want to run:
cat rmdata.sql delete from ipTotals_69_55_224 where date >='2011-11-04'; delete from ipTotals_69_55_225 where date >='2011-11-04'; delete from ipTotals_69_55_226 where date >='2011-11-04'; delete from ipTotals_69_55_227 where date >='2011-11-04'; delete from ipTotals_69_55_228 where date >='2011-11-04'; delete from ipTotals_69_55_230 where date >='2011-11-04'; delete from ipTotals_69_55_232 where date >='2011-11-04'; delete from ipTotals_69_55_233 where date >='2011-11-04'; delete from ipTotals_69_55_234 where date >='2011-11-04'; delete from ipTotals_69_55_235 where date >='2011-11-04'; delete from ipTotals_69_55_236 where date >='2011-11-04'; delete from ipTotals_69_55_237 where date >='2011-11-04'; delete from ipTotals_69_55_238 where date >='2011-11-04'; delete from ipTotals_69_55_239 where date >='2011-11-04'; delete from dailyIpTotals_69_55_224 where date >='2011-11-04'; delete from dailyIpTotals_69_55_225 where date >='2011-11-04'; delete from dailyIpTotals_69_55_226 where date >='2011-11-04'; delete from dailyIpTotals_69_55_227 where date >='2011-11-04'; delete from dailyIpTotals_69_55_228 where date >='2011-11-04'; delete from dailyIpTotals_69_55_230 where date >='2011-11-04'; delete from dailyIpTotals_69_55_232 where date >='2011-11-04'; delete from dailyIpTotals_69_55_233 where date >='2011-11-04'; delete from dailyIpTotals_69_55_234 where date >='2011-11-04'; delete from dailyIpTotals_69_55_235 where date >='2011-11-04'; delete from dailyIpTotals_69_55_236 where date >='2011-11-04'; delete from dailyIpTotals_69_55_237 where date >='2011-11-04'; delete from dailyIpTotals_69_55_238 where date >='2011-11-04'; delete from dailyIpTotals_69_55_239 where date >='2011-11-04'; delete from portTotals_69_55_224 where date >='2011-11-04'; delete from portTotals_69_55_225 where date >='2011-11-04'; delete from portTotals_69_55_226 where date >='2011-11-04'; delete from portTotals_69_55_227 where date >='2011-11-04'; delete from portTotals_69_55_228 where date >='2011-11-04'; delete from portTotals_69_55_230 where date >='2011-11-04'; delete from portTotals_69_55_232 where date >='2011-11-04'; delete from portTotals_69_55_233 where date >='2011-11-04'; delete from portTotals_69_55_234 where date >='2011-11-04'; delete from portTotals_69_55_235 where date >='2011-11-04'; delete from portTotals_69_55_236 where date >='2011-11-04'; delete from portTotals_69_55_237 where date >='2011-11-04'; delete from portTotals_69_55_238 where date >='2011-11-04'; delete from portTotals_69_55_239 where date >='2011-11-04'; delete from dailyPortTotals_69_55_224 where date >='2011-11-04'; delete from dailyPortTotals_69_55_225 where date >='2011-11-04'; delete from dailyPortTotals_69_55_226 where date >='2011-11-04'; delete from dailyPortTotals_69_55_227 where date >='2011-11-04'; delete from dailyPortTotals_69_55_228 where date >='2011-11-04'; delete from dailyPortTotals_69_55_230 where date >='2011-11-04'; delete from dailyPortTotals_69_55_232 where date >='2011-11-04'; delete from dailyPortTotals_69_55_233 where date >='2011-11-04'; delete from dailyPortTotals_69_55_234 where date >='2011-11-04'; delete from dailyPortTotals_69_55_235 where date >='2011-11-04'; delete from dailyPortTotals_69_55_236 where date >='2011-11-04'; delete from dailyPortTotals_69_55_237 where date >='2011-11-04'; delete from dailyPortTotals_69_55_238 where date >='2011-11-04'; delete from dailyPortTotals_69_55_239 where date >='2011-11-04';
Run it: mysql -p traffic < rmdata.sql on bwdb AND backup1 (this will take some time to complete)
3. replace all the flow files for re-processing
cp /usr/home/archive/ft-v05.2012-11-04.* /usr/home/working/ cp /usr/home/archive/ft-v05.2012-11-05.* /usr/home/working/ cp /usr/home/archive/ft-v05.2012-11-06.* /usr/home/working/
Look for any temporary files at the time of the crash, move and rename:
mv /usr/home/flows/2012-11/2012-11-04/tmp-v05.2012-11-04.200000-0800 /usr/home/archive/ft-v05.2012-11-04.200000-0800
4. restart slaving On bwdb:
mysql> show master status; +-----------------+----------+--------------+------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | +-----------------+----------+--------------+------------------+ | bwdb-bin.001557 | 5612 | | | +-----------------+----------+--------------+------------------+ 1 row in set (0.00 sec)
On backup1:
mysql> stop slave; Query OK, 0 rows affected, 1 warning (0.00 sec) mysql> CHANGE MASTER TO MASTER_HOST='10.1.4.203',MASTER_USER='repl',MASTER_PASSWORD='qERUG8wf',MASTER_LOG_FILE='bwdb-bin.001557',MASTER_LOG_POS=5612; Query OK, 0 rows affected (1.35 sec) mysql> start slave; Query OK, 0 rows affected (0.00 sec) mysql> show slave status\G; *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 10.1.4.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: bwdb-bin.001557 Read_Master_Log_Pos: 5612 Relay_Log_File: mysqld-relay-bin.000002 Relay_Log_Pos: 140 Relay_Master_Log_File: bwdb-bin.001557 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: traffic.% Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 5612 Relay_Log_Space: 140 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 1 row in set (0.00 sec)
5. on bwdb re-process the flows: /usr/home/flowbin/processflows.pl Shortly thereafter you should see the Relay_Master_Log_File on backup1 increase as it follows bwdb's replication. You should also see the list of flow files in the queue start to shrink (at least the ones from days ago will start to disappear).
6. once processflows.pl is done, uncomment the cronjob to let it start running on its own
"backup1: processsql.pl error"[edit]
bzcat /data/bwdb2/pending/2012-11-09-04:30.sql.bz2.done | mysql -u bwdb2 -pxxxxx traffic (ERROR 2013 (HY000): Lost connection to MySQL server at 'sending authentication information', system error: 32)
This is the result of the mysql server not running to accept the data from bwdb2.
Action to take: None assuming you get just a couple/few of these, and it was due to an explained outage.
"bwdb2: sendsql.pl error"[edit]
scp -Cq /usr/home/sql/2012-11-08-16:30.sql.bz2 backup1:/data/bwdb2/pending/ (lost connection)
This is usually the result of a network disruption or outage on backup1/bwdb2 causing the bandwidth file to not be transferred to backup1 to be imported into the traffic database. This is nothing to be too concerned about, it should retry this transfer on the next cron run.
Action to take: None if you get just a couple/few of these, and it was due to an explained outage.
"RAID controller problem on newbackup3"[edit]
1 ARC-1160-VOL#00 Raid Set # 00 Raid5 5000.0GB 00/00/00 Checking(77%)
Action to take: None, assuming you initiated the verify.
"Bandwidth limit notification"[edit]
A process on mail (top20customer.pl) periodically looks for customers who are nearing or at their bandwidth limit (which varies based on the package and # of systems), and emails support@johncompanies.com & sales@johncompanies.com when they’re getting close. Customers have 3 options for how to handle bandwidth overage:
- stop – they want traffic to be stopped (we cap, not stop actually) when they reach their allotment
- limit – they want to be billed for extra traffic, up to a certain amount
- go – they want to be billed for any amount of extra traffic
There are 6 conditions which are met that will generate this email notice, 3 conditions are warnings (when they’re within 10% of the next action/level) and 3 are limits – when they’ve reached the limt/level. So, given option 1, a warning will come at 90% ( “warn (cap)”), and again when all bandwidth is used ( “limit (cap)”), Given option 2, a warning will come at 90% (“warn (limit)”), and again when all bandwidth and authorized overage is used (“limit (limit)”), Given option 3. a warning will come at 90% (“warn (go)”) and again when all bandwidth is used (“limit (go)”).
When the email is sent to support about the bandwidth warning:
Subject: Bandwidth limit notification customer # usage level b/w option col00688 1 45.05G warn limit to $40
go to the customer’s page (for each customer listed in the email) and click the “send notice” link- a customized email will be sent to the contacts listed for the account (cc’d to payments and support) indicating what the situation is.
After someone has reached the end of their purchased or included bandwidth, they should be stopped, but we don’t do that – we cap them at 256k. Linux customers are easily capped using bwcap, everyone else is capped in the firewall. Linux customers will automatically uncap on the first of the month, firewall capped customers must be manually uncapped.
On the first of the month a bandwidth report is sent out to support detailing who should be billed for overage.
There are pastes containing the text that’s automatically sent to customers when you click “send notice” (Mgmt -> pastes).
Action to take: After reviewing whether to contact the customer or limit bandwidth, you can send them a notice via the 'Send Notice' link in the bandwidth section of their mgmt page. The notice will be tailored to the current situation (warning vs over). The email can be deleted.
"Bandwidth Overage Report for March 2010"[edit]
customer avail. usage b/w option col00393 250 464.37 go/bill col01556 200 261.28 go/bill customer overage charge col00393 214.37 $160.7775 col01556 61.28 $45.96 Total collection amount: $206.7375
Action to take: Use this report to decide if and how much to bill a customer for bandwidth usage. Save/archive the report email.
Backup failures[edit]
jail1 at 2012-12-12 : rsync.backup attempted to backup /mnt/data1/69.55.233.233-col01243/asd but that directory is non-existent.
This is an automated email sent out by the vps backup scripts when they run into a problem. It means a directory was removed by a customer or perhaps the vps was cancelled and directories were not removed from the nightly backup script, or perhaps we moved their mount vps point/location and didn't update the backup config.
Action to take: Rectify the problem by removing or renaming the dir to backup (contact the customer if this was not cause we moved the customer due to cancellation or other reason). Delete the email notice.