Bandwidth Management: Difference between revisions

From JCWiki
Jump to navigation Jump to search
 
(13 intermediate revisions by the same user not shown)
Line 1: Line 1:
TODO
TODO
= Finding who's causing bandwidth spike =
We find out about bandwidth usage spikes in one of several ways:
* NOC calls and tells us they notice a large usage spike
* we see a system-generated email telling us a customer has passed their usage
* speed complaints are coming in
* we notice the spike on the mrtg page
Determining the cause of the spike is fairly easy with a bit of looking.
Castle: Open up the mrtg graph for p1a (the top-level switch for most of the machines at castle): mgmt -> monitoring -> p1a -> bytes/sec
i2b: Open up the mrtg graph for p20 (the top-level switch for most of the machines at i2b): mgmt -> monitoring -> p20
From there, you can begin to narrow down from which switch spike is coming from, and then you would load the mrtg graph for that switch and further narrow down by port/device. Word of caution- even though the mrtg graphs show labels to indicate which device is connected to which port, you should take followup steps to confirm which machine is actually in that port (except for 3750, p1a, p1b, p20 where the labeling should be accurate, also in general the switches at i2b are mostly correctly labeled). See [[Switch_Control#Finding_which_IPs_are_on_a_port|Finding which IPs are on a port]]
The most appropriate action would be to surgically limit the IP/VPS/server from transmitting at the high rate (rather than cut of the entire port/server). Your action will depend on what is connected to the port:
Virt/jail:
* you can look at the host machine and try to determine who/what is causing the spike. look for processes consuming lots of CPU (run [[VPS_Management#bwcap|jt]] and [[VPS_Management#bwcap|vwe]])
* you could ask the NOC to tell you the top talkers, maybe you can narrow down what the IPs are they should look for based on IPs in use on device
* on virts you can run <tt>vznetstat</tt> and do differentials to see which counter is rising fastest, you may also see network usage or latency figures in [[VPS_Management#bwcap|vzstat]]
* on freebsd you can broadly assign a per-IP bandwidth cap:
ipfw pipe 2 config bw 14Mbit/s mask src-ip 0xffffffff
ipfw add 1 pipe 2 ip from me to any via bge1
(alter/check speed and nic- bge0/em1, etc)
* if you find the IP/CT on a virt you can [[VPS_Management#bwcap|bwcap]]:
bwcap <VEID> <kbps>
ex: bwcap 1324 256
* you can also cap in the firewall or on the host jail (in the case of a VPS on a jail). See [[FreeBSD_Reference#Setting_up_bandwidth_caps|Setting up bandwidth caps]]
Customer/colo:
* your best option is just to cap in the firewall. Don't forget to setup the cap in the right firewall (firewall @ castle vs. firewall2 @ i2b) See [[FreeBSD_Reference#Setting_up_bandwidth_caps|Setting up bandwidth caps]]
Universal options:
* you can cap the port speed within the switch. See [[Switch_Control#Controlling_port_speed|Controlling port speed]]
* you can turn off the port entirely (last resort). See [[Switch_Control#Shutting_down_a_port|Shutting down a port]]
Note about big customers<br>
You should always take care to exclude rsync.net traffic 69.43.165.x when doing (asking NOC to take) captures, as well as our largest customer col01372, who's main IPs are 69.55.234.230 and 69.55.234.246
Any time you see traffic spikes from those addresses, it's likely just a lot of traffic and not something running awry, nothing to be capped (unless all network traffic is getting cut off). To that end, col01372 has his own switch that's not coming into p1a, it connects directly to the network router (3750) so you'll have to look at the 3750 mrtg graph to see if there is a spike on his port- it's labeled for col01372.
= Caps =
= Caps =


[[FreeBSD_Reference#Setting_up_bandwidth_caps|Setting up bandwidth caps]]
If it's happening on or running through a FreeBSD server (i.e. a VPS on a jail or traffic running through our firewalls) see [[FreeBSD_Reference#Setting_up_bandwidth_caps|Setting up bandwidth caps]]
 
If it's happening on a virt, see [[VPS_Management#bwcap|bwcap]]


= Reporting =
= Reporting =
You will receive a notice on the 1st of the month that details who went over their allocation: See [[System-generated_Notifications#.22Bandwidth_Overage_Report_for_March_2010.22|Bandwidth Overage Report]]
Any action on that overage is a manual billing task.
Up to date bandwidth usage can be seen by visiting the customer's mgmt page. The usage is a summary of all their IPs/servers. To get more information, you may run bandwidth reports by clicking on the "view" link in the Bandwidth table on their customer page in mgmt. You can also reach this page from mgmt -> reference -> bandwidth
From there you can pull up details based on individual IPs, or by hostname, systemID and display daily usage info or more granular info (data in 15min increments). This page also shows the top IPs and customers - updated by cronjob hourly. You can send the graphs generated in this page to the account owner by clicking "send graph to customer"
Even more detailed info can be retrieved via the customer's account manager -> bandwidth page. They can search by protocol, port or get raw data. This is the recommended way to get bandwidth data.
= Notices =
= Notices =
See [[System-generated_Notifications#.22Bandwidth_limit_notification.22|Bandwidth limit notification]]
See [[System-generated_Notifications#.22Bandwidth_Overage_Report_for_March_2010.22|Bandwidth Overage Report]]

Latest revision as of 16:42, 10 January 2013

TODO

Finding who's causing bandwidth spike[edit]

We find out about bandwidth usage spikes in one of several ways:

  • NOC calls and tells us they notice a large usage spike
  • we see a system-generated email telling us a customer has passed their usage
  • speed complaints are coming in
  • we notice the spike on the mrtg page

Determining the cause of the spike is fairly easy with a bit of looking.

Castle: Open up the mrtg graph for p1a (the top-level switch for most of the machines at castle): mgmt -> monitoring -> p1a -> bytes/sec

i2b: Open up the mrtg graph for p20 (the top-level switch for most of the machines at i2b): mgmt -> monitoring -> p20

From there, you can begin to narrow down from which switch spike is coming from, and then you would load the mrtg graph for that switch and further narrow down by port/device. Word of caution- even though the mrtg graphs show labels to indicate which device is connected to which port, you should take followup steps to confirm which machine is actually in that port (except for 3750, p1a, p1b, p20 where the labeling should be accurate, also in general the switches at i2b are mostly correctly labeled). See Finding which IPs are on a port

The most appropriate action would be to surgically limit the IP/VPS/server from transmitting at the high rate (rather than cut of the entire port/server). Your action will depend on what is connected to the port:


Virt/jail:

  • you can look at the host machine and try to determine who/what is causing the spike. look for processes consuming lots of CPU (run jt and vwe)
  • you could ask the NOC to tell you the top talkers, maybe you can narrow down what the IPs are they should look for based on IPs in use on device
  • on virts you can run vznetstat and do differentials to see which counter is rising fastest, you may also see network usage or latency figures in vzstat
  • on freebsd you can broadly assign a per-IP bandwidth cap:
ipfw pipe 2 config bw 14Mbit/s mask src-ip 0xffffffff
ipfw add 1 pipe 2 ip from me to any via bge1

(alter/check speed and nic- bge0/em1, etc)

  • if you find the IP/CT on a virt you can bwcap:
bwcap <VEID> <kbps>

ex: bwcap 1324 256


Customer/colo:

  • your best option is just to cap in the firewall. Don't forget to setup the cap in the right firewall (firewall @ castle vs. firewall2 @ i2b) See Setting up bandwidth caps


Universal options:

Note about big customers
You should always take care to exclude rsync.net traffic 69.43.165.x when doing (asking NOC to take) captures, as well as our largest customer col01372, who's main IPs are 69.55.234.230 and 69.55.234.246 Any time you see traffic spikes from those addresses, it's likely just a lot of traffic and not something running awry, nothing to be capped (unless all network traffic is getting cut off). To that end, col01372 has his own switch that's not coming into p1a, it connects directly to the network router (3750) so you'll have to look at the 3750 mrtg graph to see if there is a spike on his port- it's labeled for col01372.

Caps[edit]

If it's happening on or running through a FreeBSD server (i.e. a VPS on a jail or traffic running through our firewalls) see Setting up bandwidth caps

If it's happening on a virt, see bwcap

Reporting[edit]

You will receive a notice on the 1st of the month that details who went over their allocation: See Bandwidth Overage Report

Any action on that overage is a manual billing task.

Up to date bandwidth usage can be seen by visiting the customer's mgmt page. The usage is a summary of all their IPs/servers. To get more information, you may run bandwidth reports by clicking on the "view" link in the Bandwidth table on their customer page in mgmt. You can also reach this page from mgmt -> reference -> bandwidth From there you can pull up details based on individual IPs, or by hostname, systemID and display daily usage info or more granular info (data in 15min increments). This page also shows the top IPs and customers - updated by cronjob hourly. You can send the graphs generated in this page to the account owner by clicking "send graph to customer"

Even more detailed info can be retrieved via the customer's account manager -> bandwidth page. They can search by protocol, port or get raw data. This is the recommended way to get bandwidth data.

Notices[edit]

See Bandwidth limit notification

See Bandwidth Overage Report