Bandwidth Management

From JCWiki
Revision as of 16:30, 10 January 2013 by 70.230.212.110 (talk) (→‎Caps)
Jump to navigation Jump to search

TODO

Finding who's causing bandwidth spike

We find out about bandwidth usage spikes in one of several ways:

  • NOC calls and tells us they notice a large usage spike
  • we see a system-generated email telling us a customer has passed their usage
  • speed complaints are coming in
  • we notice the spike on the mrtg page

Determining the cause of the spike is fairly easy with a bit of looking.

Castle: Open up the mrtg graph for p1a (the top-level switch for most of the machines at castle): mgmt -> monitoring -> p1a -> bytes/sec

i2b: Open up the mrtg graph for p20 (the top-level switch for most of the machines at i2b): mgmt -> monitoring -> p20

From there, you can begin to narrow down from which switch spike is coming from, and then you would load the mrtg graph for that switch and further narrow down by port/device. Word of caution- even though the mrtg graphs show labels to indicate which device is connected to which port, you should take followup steps to confirm which machine is actually in that port (except for 3750, p1a, p1b, p20 where the labeling should be accurate, also in general the switches at i2b are mostly correctly labeled). See Finding which IPs are on a port

The most appropriate action would be to surgically limit the IP/VPS/server from transmitting at the high rate (rather than cut of the entire port/server). Your action will depend on what is connected to the port:


Virt/jail:

  • you can look at the host machine and try to determine who/what is causing the spike. look for processes consuming lots of CPU (run jt and vwe)
  • you could ask the NOC to tell you the top talkers, maybe you can narrow down what the IPs are they should look for based on IPs in use on device
  • on virts you can run vznetstat and do differentials to see which counter is rising fastest, you may also see network usage or latency figures in vzstat
  • on freebsd you can broadly assign a per-IP bandwidth cap:
ipfw pipe 2 config bw 14Mbit/s mask src-ip 0xffffffff
ipfw add 1 pipe 2 ip from me to any via bge1

(alter/check speed and nic- bge0/em1, etc)

  • if you find the IP/CT on a virt you can cap:
bwcap <VEID> <kbps>

ex: bwcap 1324 256


Customer/colo:

  • your best option is just to cap in the firewall. Don't forget to setup the cap in the right firewall (firewall @ castle vs. firewall2 @ i2b) See Setting up bandwidth caps


Universal options:

Note about big customers
You should always take care to exclude rsync.net traffic 69.43.165.x when doing (asking NOC to take) captures, as well as our largest customer col01372, who's main IPs are 69.55.234.230 and 69.55.234.246 Any time you see traffic spikes from those addresses, it's likely just a lot of traffic and not something running awry, nothing to be capped (unless all network traffic is getting cut off). To that end, col01372 has his own switch that's not coming into p1a, it connects directly to the network router (3750) so you'll have to look at the 3750 mrtg graph to see if there is a spike on his port- it's labeled for col01372.

Caps

If it's happening on or running through a FreeBSD server (i.e. a VPS on a jail or traffic running through our firewalls) see Setting up bandwidth caps

If it's happening on a virt, see bwcap

Reporting

Notices