Bandwidth Management
TODO
Finding who's causing bandwidth spike
We find out about bandwidth usage spikes in one of several ways:
- NOC calls and tells us they notice a large usage spike
- we see a system-generated email telling us a customer has passed their usage
- speed complaints are coming in
- we notice the spike on the mrtg page
Determining the cause of the spike is fairly easy with a bit of looking.
Castle: Open up the mrtg graph for p1a (the top-level switch for most of the machines at castle): mgmt -> monitoring -> p1a -> bytes/sec
i2b: Open up the mrtg graph for p20 (the top-level switch for most of the machines at i2b): mgmt -> monitoring -> p20
From there, you can begin to narrow down from which switch spike is coming from, and then you would load the mrtg graph for that switch and further narrow down by port/device. Word of caution- even though the mrtg graphs show labels to indicate which device is connected to which port, you should take followup steps to confirm which machine is actually in that port (except for 3750, p1a, p1b, p20 where the labeling should be accurate, also in general the switches at i2b are mostly correctly labeled). See Finding which IPs are on a port
The most appropriate action would be to surgically limit the IP/VPS/server from transmitting at the high rate (rather than cut of the entire port/server). Your action will depend on what is connected to the port:
Virt/jail:
- you can look at the host machine and try to determine who/what is causing the spike. look for processes consuming lots of CPU (run jt and vwe)
- you could ask the NOC to tell you the top talkers, maybe you can narrow down what the IPs are they should look for based on IPs in use on device
- on virts you can run vznetstat and do differentials to see which counter is rising fastest, you may also see network usage or latency figures in vzstat
- on freebsd you can broadly assign a per-IP bandwidth cap:
ipfw pipe 2 config bw 14Mbit/s mask src-ip 0xffffffff ipfw add 1 pipe 2 ip from me to any via bge1
(alter/check speed and nic- bge0/em1, etc)
- if you find the IP/CT on a virt you can bwcap:
bwcap <VEID> <kbps>
ex: bwcap 1324 256
- you can also cap in the firewall or on the host jail (in the case of a VPS on a jail). See Setting up bandwidth caps
Customer/colo:
- your best option is just to cap in the firewall. Don't forget to setup the cap in the right firewall (firewall @ castle vs. firewall2 @ i2b) See Setting up bandwidth caps
Universal options:
- you can cap the port speed within the switch. See Controlling port speed
- you can turn off the port entirely (last resort). See Shutting down a port
Note about big customers
You should always take care to exclude rsync.net traffic 69.43.165.x when doing (asking NOC to take) captures, as well as our largest customer col01372, who's main IPs are 69.55.234.230 and 69.55.234.246
Any time you see traffic spikes from those addresses, it's likely just a lot of traffic and not something running awry, nothing to be capped (unless all network traffic is getting cut off). To that end, col01372 has his own switch that's not coming into p1a, it connects directly to the network router (3750) so you'll have to look at the 3750 mrtg graph to see if there is a spike on his port- it's labeled for col01372.
Caps
If it's happening on or running through a FreeBSD server (i.e. a VPS on a jail or traffic running through our firewalls) see Setting up bandwidth caps
If it's happening on a virt, see bwcap
Reporting
You will receive a notice on the 1st of the month that details who went over their allocation: System-generated_Notifications#Bandwidth Overage Report
Any action on that overage is a manual billing task.
Up to date bandwidth usage can be seen by visiting the customer's mgmt page. The usage is a summary of all their IPs/servers. To get more information, you may run bandwidth reports by clicking on the "view" link in the Bandwidth table on their customer page in mgmt. You can also reach this page from mgmt -> reference -> bandwidth From there you can pull up details based on individual IPs, or by hostname, systemID and display daily usage info or more granular info (data in 15min increments). This page also shows the top IPs and customers - updated by cronjob hourly. You can send the graphs generated in this page to the account owner by clicking "send graph to customer"
Even more detailed info can be retrieved via the customer's account manager -> bandwidth page. They can search by protocol, port or get raw data. This is the recommended way to get bandwidth data.