System-generated Notifications

From JCWiki
Jump to navigation Jump to search

"snapshot rotation done on backup1"

Sent daily - expect to receive this!

Action to take: Confirm that this was received before midnight (or whenever backups start from virts/jails -> backup server). Delete email.

"RAID controller problem on backup1.johncompanies.com"

Ignoring all alarms prior to 2012-09-12-12-36-37
unitu0 drive p1 status= DEVICE-ERROR
there was a WARNING event on 2012-09-14 01:59:39
there was a WARNING event on 2012-09-14 02:08:27
there was a WARNING event on 2012-09-14 03:54:47
there was a WARNING event on 2012-09-15 02:38:14
there was a WARNING event on 2012-09-15 02:59:02
there was a WARNING event on 2012-09-15 04:47:08
there was a WARNING event on 2012-09-15 04:47:31
there was a WARNING event on 2012-09-15 10:41:59
there was a WARNING event on 2012-09-15 13:25:23
there was a WARNING event on 2012-09-15 13:25:31
there was a WARNING event on 2012-09-15 13:25:54
there was a WARNING event on 2012-09-15 17:10:50
there was a WARNING event on 2012-09-18 01:17:18
there was a WARNING event on 2012-09-25 01:56:47
there was a WARNING event on 2012-09-29 02:04:14
there was a WARNING event on 2012-09-29 10:58:39
there was a WARNING event on 2012-09-29 10:59:02
there was a WARNING event on 2012-09-29 11:22:44
there was a WARNING event on 2012-09-29 13:50:48
there was a WARNING event on 2012-09-29 13:51:11
there was a WARNING event on 2012-09-29 13:51:30
there was a WARNING event on 2012-10-01 04:47:24
there was a WARNING event on 2012-10-02 02:00:27
there was a WARNING event on 2012-10-02 02:01:56
there was a WARNING event on 2012-10-02 05:02:31
there was a WARNING event on 2012-10-02 05:04:14
there was a WARNING event on 2012-10-03 01:22:12
there was a WARNING event on 2012-10-04 04:29:22
there was a WARNING event on 2012-10-04 05:10:51
there was a WARNING event on 2012-10-06 19:41:18
there was a WARNING event on 2012-10-08 00:32:06
there was a WARNING event on 2012-10-09 03:51:03
to see all status: tw_cli /c0 show all
to see all alarms: tw_cli show alarms
to silence old alarms: 3wraidchk shh

You get this when the cronjob running on the server notices there's been a new event in the logs, and those recent events are include.

Action to take: Review the logs on the server. See commands above and review tw_cli_Reference. Optional: clear the warning with 3wraidchk shh on the server that generated the notice. Delete email.

"bwdb2: sendsql.pl error"

scp -Cq /usr/home/sql/2012-10-29-10:30.sql.bz2 backup1:/data/bwdb2/pending/ (lost
connection)

Action to take: Usually none, this is a temporary failure to transfer bandwidth stats from bwdb2 to backup1 where the database lives that tracks and supplies all b/w stats. The script will continue to attempt to send data over. Only take action if it continues to fail without obvious reason (i.e. temporary outage i2b<->castle or backup1 down)

"replication failing"

replication behind by 0 days, 12 hours, 23 minutes

This message serves as a warning that the mysql replication between bwdb and backup1 is not in sync any longer. This could be the result of 1 of 2 things:

1. replication really isn't happening anymore

2. there is a backlog of netflow(s) that bwdb is working it's way through. That's why the latest data in the DB (which is what this notification looks for- stale data) is from hours ago.

To rule out #2, on bwdb simply look at the queue of flows to be processed:

bwdb /home/flowbin# ls /usr/home/working
ft-v05.2012-11-05.113000-0800
bwdb /home/flowbin#

You see there's 1 flow there ready to processes. Had there been a backlog you'd see lots of files there. So this message must mean there's a replication failure. Here's how to look at the replication status:

On bwdb:

mysql> SHOW MASTER STATUS;
+-----------------+-----------+--------------+------------------+
| File            | Position  | Binlog_Do_DB | Binlog_Ignore_DB |
+-----------------+-----------+--------------+------------------+
| bwdb-bin.001551 | 156935084 |              |                  |
+-----------------+-----------+--------------+------------------+
1 row in set (0.02 sec)

On backup1:

mysql> show slave status\G

*************************** 1. row ***************************
             Slave_IO_State:
                Master_Host: 10.1.4.203
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: bwdb-bin.001550
        Read_Master_Log_Pos: 505039281
             Relay_Log_File: mysqld-relay-bin.001527
              Relay_Log_Pos: 98
      Relay_Master_Log_File: bwdb-bin.001550
           Slave_IO_Running: No
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table: traffic.%
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 505039281
            Relay_Log_Space: 98
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.57 sec)

Our indicators that something is wrong come from looking at these fields from backup1:

Read_Master_Log_Pos: 505039281

This doesn't match up with what bwdb shows for log position: 156935084

Relay_Master_Log_File: bwdb-bin.001550 

This doesn't match up with what bwdb shows for log file: bwdb-bin.001551

Slave_IO_Running: No

This should say Yes

Here's how this is resolved:

 mysql> stop slave;
 mysql> reset slave;
 mysql> start slave;
 mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Queueing master event to the relay log
                Master_Host: 10.1.4.203
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: bwdb-bin.001549
        Read_Master_Log_Pos: 236696828
             Relay_Log_File: mysqld-relay-bin.000003
              Relay_Log_Pos: 133842
      Relay_Master_Log_File: bwdb-bin.001549
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table: traffic.%
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 132576
            Relay_Log_Space: 238358016
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: 343413

I'm a little perplexed why it reverted to log file bwdb-bin.001549, but we see that Slave_IO_Running and Slave_SQL_Running are Yes and if you re-run the show slave status command you will see that the Read_Master_Log_Pos is incrementing. However after running through all of the log file bwdb-bin.001550, it eventually fails:

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State:
                Master_Host: 10.1.4.203
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: bwdb-bin.001550
        Read_Master_Log_Pos: 504660170
             Relay_Log_File: mysqld-relay-bin.000006
              Relay_Log_Pos: 508191412
      Relay_Master_Log_File: bwdb-bin.001550
           Slave_IO_Running: No
          Slave_SQL_Running: No
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table: traffic.%
Replicate_Wild_Ignore_Table:
                 Last_Errno: 1064
                 Last_Error: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1' on query. Default database: ''. Query: ''
               Skip_Counter: 0
        Exec_Master_Log_Pos: 504659938
            Relay_Log_Space: 508191646
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

Probably cause it got to a corrupted part of the log file. After looking at http://www.zmonkey.org/blog/node/157 it looks like i may need to skip over a bad part in the log and go to the next entry for the next log position. However, after performing this on bwdb:

bwdb /usr/home/database# mysqlbinlog bwdb-bin.001550 -j 504659938
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
# at 504659938
#121104 20:04:18 server id 1  log_pos 504659938         Query   thread_id=15838 exec_time=0 error_code=0
SET TIMESTAMP=1352088258;
;
ERROR: Error in Log_event::read_log_event(): 'Event too small', data_len: 0, event_type: 0
Could not read entry at offset 504660170:Error in log format or read error
bwdb /usr/home/database#

Which is maybe an indication that this is simply the end of the log file (there's no more output after it), so if we look at the next logfile:

bwdb /usr/home/database# mysqlbinlog bwdb-bin.001551|more
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
# at 4
#121104 23:04:07 server id 1  log_pos 4         Start: binlog v 3, server v 4.1.10a-log creat
ed 121104 23:04:07 at startup
# at 79
#121104 23:18:05 server id 1  log_pos 79        Query   thread_id=2     exec_time=0     error
_code=0
use traffic;
SET TIMESTAMP=1352099885;
insert into dailyIpTotals_69_55_239 values ('6955239101-20121104-1', '2012-11-04 19:45:00', '
69.55.239.101', 1, 344, 7) ON DUPLICATE KEY UPDATE octets=octets+344, packets=packets+7;
# at 297
#121104 23:18:05 server id 1  log_pos 297       Query   thread_id=2     exec_time=0     error
_code=0
SET TIMESTAMP=1352099885;
insert into dailyIpTotals_69_55_239 values ('6955239102-20121104-1', '2012-11-04 19:45:00', '
69.55.239.102', 1, 84, 2) ON DUPLICATE KEY UPDATE octets=octets+84, packets=packets+2;

So it looks like this is what we want, and the time stamp correlates with when the server came back online (from a crash in this instance). So we reset the slave to pull from the other log:

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Queueing master event to the relay log
                Master_Host: 10.1.4.203
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: bwdb-bin.001551
        Read_Master_Log_Pos: 45486292
             Relay_Log_File: mysqld-relay-bin.000002
              Relay_Log_Pos: 140
      Relay_Master_Log_File: bwdb-bin.001551
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table: traffic.%
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 4
            Relay_Log_Space: 45805720
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: 67238
1 row in set (0.22 sec)

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: 10.1.4.203
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: bwdb-bin.001551
        Read_Master_Log_Pos: 107339149
             Relay_Log_File: mysqld-relay-bin.000002
              Relay_Log_Pos: 464584
      Relay_Master_Log_File: bwdb-bin.001551
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table: traffic.%
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 460960
            Relay_Log_Space: 108092633
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: 55577
1 row in set (0.00 sec)

And if we continue to show status we see that the position is moving and the Seconds_Behind_Master is dropping. So we should be good.

After a bit we got:

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: 10.1.4.203
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: bwdb-bin.001551
        Read_Master_Log_Pos: 196122754
             Relay_Log_File: mysqld-relay-bin.000002
              Relay_Log_Pos: 1569496
      Relay_Master_Log_File: bwdb-bin.001551
           Slave_IO_Running: Yes
          Slave_SQL_Running: No
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table: traffic.%
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error: Query caused different errors on master and slave. Error on master: 'Incorrect key file for table '%-.200s'; try to repair it' (1034), Error on slave: 'no error' (0). Default database: 'traffic'. Query: 'insert into portTotals_69_55_225 values ('2012-11-04 19:45:00', '69.55.225.225', 1, 6, 99999, 100, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 1, 0, 12414, 240.5),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 53, 15156829.5, 64580.5),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 1024, 387, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 361, 213, 1),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 833, 15120, 60),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 395, 10534, 46),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 559, 31968, 111),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 53, 1541, 24),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 80, 80, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 81, 40, 1),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 99999, 120, 3),('2012-11-04 19:45:00', '
               Skip_Counter: 0
        Exec_Master_Log_Pos: 1558174
            Relay_Log_Space: 197497274
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

So on bwdb we did:

mysql> repair table portTotals_69_55_225;

Then on backup1 after we stop, reset, start slave it eventually bombs again:

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: 10.1.4.203
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: bwdb-bin.001553
        Read_Master_Log_Pos: 99377008
             Relay_Log_File: mysqld-relay-bin.000002
              Relay_Log_Pos: 1569496
      Relay_Master_Log_File: bwdb-bin.001551
           Slave_IO_Running: Yes
          Slave_SQL_Running: No
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table: traffic.%
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error: Query caused different errors on master and slave. Error on master: 'Incorrect key file for table '%-.200s'; try to repair it' (1034), Error on slave: 'no error' (0). Default database: 'traffic'. Query: 'insert into portTotals_69_55_225 values ('2012-11-04 19:45:00', '69.55.225.225', 1, 6, 99999, 100, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 1, 0, 12414, 240.5),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 53, 15156829.5, 64580.5),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 1024, 387, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 361, 213, 1),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 833, 15120, 60),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 395, 10534, 46),('2012-11-04 19:45:00', '69.55.225.225', 2, 17, 559, 31968, 111),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 53, 1541, 24),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 80, 80, 2),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 81, 40, 1),('2012-11-04 19:45:00', '69.55.225.225', 2, 6, 99999, 120, 3),('2012-11-04 19:45:00', '
               Skip_Counter: 0
        Exec_Master_Log_Pos: 1558174
            Relay_Log_Space: 298937975
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

So at this point we will revert to the poor man's method: Plan B

What we do here is delete all data in both databases in all tables from the day of the outage forward. Restart the sync. Then we re-process the flow files on bwdb from the day of the outage, and given enough time it catch up will rebuild the database (which will properly replicate as the processing continues). So, given an outage on 11/4/12 and today is the 6th:

  1. comment out the processing script temporarily:
#process flow data
#2,17,32,47 * * * * /usr/home/flowbin/processflows.pl
  1. we create an sql file with all the commands we want to run:
cat rmdata.sql
delete from ipTotals_69_55_224 where date >='2011-11-04';
delete from ipTotals_69_55_225 where date >='2011-11-04';
delete from ipTotals_69_55_226 where date >='2011-11-04';
delete from ipTotals_69_55_227 where date >='2011-11-04';
delete from ipTotals_69_55_228 where date >='2011-11-04';
delete from ipTotals_69_55_230 where date >='2011-11-04';
delete from ipTotals_69_55_232 where date >='2011-11-04';
delete from ipTotals_69_55_233 where date >='2011-11-04';
delete from ipTotals_69_55_234 where date >='2011-11-04';
delete from ipTotals_69_55_235 where date >='2011-11-04';
delete from ipTotals_69_55_236 where date >='2011-11-04';
delete from ipTotals_69_55_237 where date >='2011-11-04';
delete from ipTotals_69_55_238 where date >='2011-11-04';
delete from ipTotals_69_55_239 where date >='2011-11-04';

delete from dailyIpTotals_69_55_224 where date >='2011-11-04';
delete from dailyIpTotals_69_55_225 where date >='2011-11-04';
delete from dailyIpTotals_69_55_226 where date >='2011-11-04';
delete from dailyIpTotals_69_55_227 where date >='2011-11-04';
delete from dailyIpTotals_69_55_228 where date >='2011-11-04';
delete from dailyIpTotals_69_55_230 where date >='2011-11-04';
delete from dailyIpTotals_69_55_232 where date >='2011-11-04';
delete from dailyIpTotals_69_55_233 where date >='2011-11-04';
delete from dailyIpTotals_69_55_234 where date >='2011-11-04';
delete from dailyIpTotals_69_55_235 where date >='2011-11-04';
delete from dailyIpTotals_69_55_236 where date >='2011-11-04';
delete from dailyIpTotals_69_55_237 where date >='2011-11-04';
delete from dailyIpTotals_69_55_238 where date >='2011-11-04';
delete from dailyIpTotals_69_55_239 where date >='2011-11-04';

delete from portTotals_69_55_224 where date >='2011-11-04';
delete from portTotals_69_55_225 where date >='2011-11-04';
delete from portTotals_69_55_226 where date >='2011-11-04';
delete from portTotals_69_55_227 where date >='2011-11-04';
delete from portTotals_69_55_228 where date >='2011-11-04';
delete from portTotals_69_55_230 where date >='2011-11-04';
delete from portTotals_69_55_232 where date >='2011-11-04';
delete from portTotals_69_55_233 where date >='2011-11-04';
delete from portTotals_69_55_234 where date >='2011-11-04';
delete from portTotals_69_55_235 where date >='2011-11-04';
delete from portTotals_69_55_236 where date >='2011-11-04';
delete from portTotals_69_55_237 where date >='2011-11-04';
delete from portTotals_69_55_238 where date >='2011-11-04';
delete from portTotals_69_55_239 where date >='2011-11-04';

delete from dailyPortTotals_69_55_224 where date >='2011-11-04';
delete from dailyPortTotals_69_55_225 where date >='2011-11-04';
delete from dailyPortTotals_69_55_226 where date >='2011-11-04';
delete from dailyPortTotals_69_55_227 where date >='2011-11-04';
delete from dailyPortTotals_69_55_228 where date >='2011-11-04';
delete from dailyPortTotals_69_55_230 where date >='2011-11-04';
delete from dailyPortTotals_69_55_232 where date >='2011-11-04';
delete from dailyPortTotals_69_55_233 where date >='2011-11-04';
delete from dailyPortTotals_69_55_234 where date >='2011-11-04';
delete from dailyPortTotals_69_55_235 where date >='2011-11-04';
delete from dailyPortTotals_69_55_236 where date >='2011-11-04';
delete from dailyPortTotals_69_55_237 where date >='2011-11-04';
delete from dailyPortTotals_69_55_238 where date >='2011-11-04';
delete from dailyPortTotals_69_55_239 where date >='2011-11-04';

Run it: mysql -p traffic < rmdata.sql on bwdb AND backup1 (this will take some time to complete)

  1. replace all the flow files for re-processing
cp /usr/home/archive/ft-v05.2012-11-04.* /usr/home/working/
cp /usr/home/archive/ft-v05.2012-11-05.* /usr/home/working/
cp /usr/home/archive/ft-v05.2012-11-06.* /usr/home/working/

Look for any temporary files at the time of the crash, move and rename:

mv /usr/home/flows/2012-11/2012-11-04/tmp-v05.2012-11-04.200000-0800  /usr/home/archive/ft-v05.2012-11-04.200000-0800  
  1. restart slaving
  1. process the flows: processflows.pl
  1. uncomment the cronjob