Internet and Cloud Intelligence Blog

Solutions for monitoring alert fatigue Alert Fatigue is a common problem in monitoring systems, and happens whenever there’s an excess of alerts being triggered that are mostly false positives, i.e. noise. Finding a good balance of false negatives and false positives is somewhat of an art rather than a science. In this article I will talk about some techniques to mitigate alert fatigue.

Automatic baselining instead of fixed-threshold alerts
Automatic baselining is the capability of automatically establishing a baseline for a certain metric and alert on statistical deviations from this metric. Auto baselines are mostly useful when data shows periodic variations (e.g. traffic patterns), or when the value of the alert threshold varies per location. There can be cases though where you still want to keep manual static thresholds, for example, for packet loss, route reachability, and availability.
Secondary filters

Reference metrics to filter local problems
A common nemesis of external monitoring systems are local problems in the agent locations. For instance, if the agent can’t reach the default gateway, it will bring the value of packet loss to 100% to all the targets, and suddenly you have a flood of alert emails in your mailbox. At ThousandEyes, we found a simple yet effective solution to this problem. Whenever we add a new public agent to our infrastructure, we schedule a set of tests from N other agents to that agent as the target. Therefore we know when the agent where tests are done from is experiencing network problems. In the timeline figure below, we can notice the gray bands below the timeline, they correspond to local network connectivity with an agent located in Phoenix, AZ. We can use this information to suppress alerts from the specific location whenever it’s experiencing local problems.
Reasonable default alert rules
It’s important to have a default set of alert rules for each test, since that’s what most users are going to work with. Therefore, default alert rules are a good opportunity to fine tune the balance between false negatives and false positives. A good starting point is to select metrics that are associated with availability instead of speed, e.g. packet loss, DNS resolution, etc. And thresholds that err on minimizing false positives.
Aggregate multiple alerts into a single message
This is a trick to minimize the number of emails sent for alerts that overlap in time. Lets say there is one alert being triggered for each of N different targets at a given time. Wouldn’t it be better to receive a single email with all these alerts rather than N different messages? If alerts are happening at same time, there’s also a chance they are correlated, so seeing them in a single message makes sense.

Along with the points described above, it’s also important for a monitoring system to provide an API where alerts can be fetched. The API can be used to download raw alerts, and have logic at the user side to filter what is relevant and what is not. In some sense this is pushing the complexity of alert processing to the user side for more advanced use cases.

At ThousandEyes we spend a significant effort to reduce the number of false positive alerts and alert fatigue in general, while making sure we have sensitive default alert rules to prevent obvious false negatives.

The post Top 5 Prescriptions for Alert Fatigue appeared first on ThousandEyes Blog.

Well, it’s already November, and we’re in the final stretch of the year. It’s been a big year for us, including our official Launch at Structure 2013, massive team growth, and a ton of really cool product updates. We’ve stayed on schedule for bi-weekly releases throughout the year, and have 4 remaining this year. I’ll lay out some of the key product updates which have occurred in the last couple of months, in our first periodic ThousandEyes product update post.

Suffice it to say that it’s been fun, but we’re just getting started… Enjoy!

Virtual Appliances

Blog-Latest-Product-Updates-217x200
Deploying private agents has never been so simple. Prepackaged virtual appliances, delivered in industry-standard OVF and Hyper-V formats allow fast deployment of private agents to nearly any type of target host. Deploy on a PC, Mac, Linux or an enterprise-class virtual host using our simple to use virtual appliances, which support proxied connection to the internet, and can be pre-packaged according to the needs of you, our customer.

Alerting Functionality

We’ve been through several iterations on our alerting functionality over the past year. From management of highly customizable alert rules and thresholds, to false-positive detection at the agent level, we’ve tuned our alerting infrastructure based on feedback from our customer base to make our alerting as timely and relevant as possible for our end users. Check out our CTO Ricardo’s post on prescriptions for minimizing alert fatigue.

alerting chart

Sharing

Part of dealing with application performance in the context of the network is collaborating with other cross-functional teams, both within and outside of your organization: why send emails and work in separate systems when you can collaborate in the same system, seeing the same data?

To augment the capabilities of our Snapshot sharing capability, we’ve introduced the concept of Live sharing.

Live sharing allows the sharing of a test between two current customers of ThousandEyes, and is purpose-built to allow enterprises and their SaaS providers to share a live view of test results between their organizations. This tremendous capability facilitates collaboration by allowing SaaS providers to gain insight into test results from behind the Enterprise customer’s firewall, without having access to any of the installed infrastructure.

We also released the concept of Saved Events, allowing users to save the results from a block of time around an event indefinitely. Saved events allow users to look at saved event baselines of tests, or specific issues impacting their infrastructure, making preservation and sharing of event data ever simpler.

Bandwidth Estimation

We’ve introduced TCP throughput analysis, which, unlike many solutions today, runs without having control of both endpoints and without transferring massive files, to help identify bottlenecked network paths transited from private agents.
bandwidth availability

For those interested in reading a thought-provoking blog post around TCP throughput model measurements, our own João Antunes recently wrote a blog post discussing a simple model for TCP throughput.

MPLS Tunnel Inference

In the last couple of months, we’ve honed our methods for inferring the existence and type of MPLS tunnels, for presentation using the Deep Path Analysis capability of ThousandEyes. This update provides heretofore-unseen transparency into transit MPLS networks, identifying labels of transited MPLS devices, which helps in identification and understanding of problems related to latency, obscured inside an MPLS network.
mpls tunnel

Single Sign-On

ThousandEyes now supports Single Sign-On, using Security Assertion Markup Language (SAML) 2.0. Organizations leveraging SAML authentication can now integrate their third party identity provider into ThousandEyes for authentication into the platform, using a series of easy-to-follow steps, located here (Authentication Required).

API

Our API was no stranger to updates, as customers worked on consuming data collected by ThousandEyes. We’ve introduced API versioning, added JSON support, added write capability, and the ability to run instant tests from private agents programmatically.

Our API has undergone significant changes recently, as we launched APIv3. From our new developer reference site to the underlying changes, we’re continuing to drive ease of use and accessibility of data. Stay tuned to this discussion forum (Authentication Required) for more updates.

Community and Feature Requests

From an overall customer perspective, we’ve also taken a proactive stance for curation of our community, whether in the area of requests from users, or our knowledge base. Our Customer Success Center is our customer portal, providing in-depth articles around the inner workings of our platform, user guides and screencasts, as well as an area for customers to post questions around challenging issues. We strongly encourage all customers to participate!

Have you ever had an idea for something cool that you could do with your data, or something was missing from a view in ThousandEyes? We’ve added a feature to the footer of our user interface to allow you to submit your requests; just enter your suggestion and our system will notify the product team!

feature request

Like this post? You can subscribe to our by-release notifications in our Community Release Updates forum (Authentication Required), or check back here: we’ll be publishing periodic rollups of our feature updates here, on the ThousandEyes blog.

The post ThousandEyes Product Update: November 2013 appeared first on ThousandEyes Blog.

SAML (Security Assertion Markup Language) is an XML based standard maintained by OASIS, developed to facilitate the exchange of authentication and authorization data between parties. The primary application of SAML is the secure assertion of an online identity by a user to a Service Provider (SP) with the help of a trusted Identity Provider (IdP), commonly referred to as Single Sign-On (SSO).

Apart from the data format itself, SAML also defines a set of Profiles, use cases of the standard, and Bindings, mappings of how a SAML message is encapsulated inside an existing protocol, for example SOAP.

The most common combination of Profile/Binding currently in use is the Web Browser SSO Profile with the HTTP POST Binding, introduced with version 2.0 of the SAML standard. In this scenario when a user tries to access a secure web resource provided by the SP, the browser will be redirected to the IdP’s website. The IdP can then authenticate the user by asking for credentials or checking for valid session cookies. After successful authentication the user is redirected back to the SP’s Assertion Consumer Service (ACS), where the SP can make an access control decision based on the asserted identity. If the SP’s policies allow it the user will finally be redirected to the originally requested resource.

The following diagram describes this procedure in more detail.

Figure 1: SAML Web Browser SSO

ThousandEyes supports SAML based SSO

ThousandEyes introduced support for SAML based SSO back in July. With this support ThousandEyes allows organizations to integrate third party identity providers with the platform, leveraging their investment in SAML to increase security and user experience when authenticating into our platform.

ThousandEyes’ support for SAML requires an Organization Admin to follow an easy configuration guide on both the SAML provider and on ThousandEyes.

ThousandEyes is now part of the Okta Application Network

Today we’re announcing a partnership with Okta, an enterprise grade identity management service.
Being a part of Okta’s Application Network means that ThousandEyes’ SAML support has been subject to an extensive technical evaluation, ensuring interoperability with Okta and compliance with the industry’s best practices.

Okta’s customers can now enjoy a simplified hassle free setup experience by selecting ThousandEyes from the available applications list in the Okta Administrator Dashboard and completing the setup guide.

You can refer to ThousandEyes’ or Okta’s support documents on this subject to learn more about the integration. And, as always, feel free to reach out to our support team if you have questions or need help.

The post SAML based SSO with ThousandEyes and Okta appeared first on ThousandEyes Blog.

At ThousandEyes we deal with huge volumes of data collected from our agents across the Internet. A large number of our tests are configured to target public facing web sites owned by our customers. The performance of each site depends heavily on the provider hosting the site and how well it is connected to the Internet, especially for sites with a global footprint of users. To understand how top online businesses go about hosting their principal domains, we looked at the top 5,000 sites with more traffic in the United States. At the basic level, the questions we wanted to answer were: 1/ how prevalent is self-hosting? 2/ what are the most popular cloud providers? 3/ how many sites are fronted by DDoS protection services?

Methodology

Each domain in the list maps to a Autonomous System Number (ASN) that represents the organization physically hosting that domain. For example, pinterest.com resolves to the IP address 23.23.131.240. If we look at routing tables for announced address blocks that cover this IP address, we find it falls under address block 23.22.0.0/15 announced by ASN 14618 belonging to Amazon.

$whois -h whois.cymru.com " -v 23.23.131.240 "
AS  	| IP           	| BGP Prefix      	| CC | Registry | Allocated  | AS Name
14618   | 23.23.131.240	| 23.22.0.0/15    	| US | arin 	| 2011-09-19 | AMAZON-AES - Amazon.com, Inc.

For each domain, we followed these steps:

Resolve the domain to an IP address; if the top level domain can’t be resolved, we try to prepend “www.”
Map the IP address to a BGP (Border Gateway Protocol) prefix using longest prefix matching on global routing tables; map the the respective ASN, as well as the name of the organization
Based on the domain/provider combination, classify the domain into one of three classes:
1. self-hosted: if the ASN belongs to the same organization that owns the domain
2. cloud-hosted: if the ASN belongs to a cloud/hosting provider
3. sec-hosted: if the ASN belongs to a DDoS mitigation service

Self-hosting vs Cloud

From the initial set of 5,000 domains, we weren’t able to map about 488 sites (~9%), and these unmapped sites had rankings uniformly distributed across the entire 5k population. The graph below shows the breakdown of type of hosting for the 4,512 mapped sites.

Type of Hosting for Top US Sites

In the cloud-hosted category, we included IaaS providers like Amazon, more traditional hosting providers like GoDaddy, CDNs like Akamai and ISPs that also provide hosting e.g. Qwest.

The self-hosted category typically includes large corporations (large enough to run BGP;)) with multiple address blocks and multiple data centers e.g. bankofamerica.com, apple.com, etc.

The sec-hosted category includes domains that are fronted by a DDoS mitigation service such as Prolexic. We only measured domains where the mitigation service was working at the DNS level, so this number is a lower bound since it does not include cases where the mitigation service works at the BGP level.

Because of our methodology, it’s also possible that some of the domains we classified as cloud-hosted domains are actually self-hosted, since they can be smaller address blocks that are advertised by the ISP, so our number for self-hosted is a lower bound.

Hosting Providers

If we just look at cloud-hosted domains, the breakdown of providers is the following:

Top Cloud Providers for Top Sites in US

The usual suspects are leading the charge (Amazon, Rackspace) hosting together more than 20% of the domains¹, but there’s a surprisingly long tail of hosting companies having more than 60% of the domains. In our measurements, we are only taking into account the principal domain, i.e. it’s possible that companies use self-hosting for their main site, and use Amazon EC2 for devtest and other use cases.

Conclusion

More than 21% of the top 5k US domains are self-hosted, and that’s biased towards sites with large geographic footprint. Amazon is the leading cloud provider in the US, but it only hosts 12.4% of the top sites that are cloud-hosted. There’s a surprising long tail of small hosting providers that host the vast majority of the sites. In the Silicon Valley, various startups kick off with Amazon and Rackspace as their hosting providers resulting in a perception that Amazon and Rackspace together host a majority of cloud hosted sites. This perception however is not substantiated by our measurements; in fact we see that Amazon only hosts about 9.6% of the top US sites (including self-hosted in the baseline). The web hosting market is still very fragmented, with 61.7% of the cloud-based sites using a large number of smaller hosting providers.

^{1. A related measurement from “Next Stop, the Cloud: Understanding Modern

Web Service Deployment in EC2 and Azure” indicated that only 4% of Alexa top 1M sites were using Amazon. ↩}

The post Reality Check: A Survey of Cloud Providers for Top Sites appeared first on ThousandEyes Blog.

When your organization has multiple branch offices the IT architecture can get quite complicated. Installing ThousandEyes private agents in each branch location gives you more accurate data about the performance of your network infrastructure, as well as web-based SaaS applications that depend on that infrastructure. To get you up and running quickly we’re supplying modules for the most popular configuration management tools. A module to deploy ThousandEyes agents with Puppet was recently released, and today we’re releasing a teagent Chef cookbook to help you install and configure the ThousandEyes private agent with Chef.

ThousandEyes Private Agent Cookbook Attributes

Below is a list with the default attributes for the teagent cookbook. As a quick but important note to the attributes, you should always set the value for the ['teagent']['account_token'] key to your account token.

Key	Type	Description	Default
['teagent']['browserbot']	Boolean	Enable Browserbot	false
['teagent']['international_langs']	Boolean	Install the international language support package	false
['teagent']['account_token']	String	Account token for the agent	sample value (equals a disabled agent)
['teagent']['log_path']	String	Agent log path	”
['teagent']['proxy_host']	String	Proxy hostname	”
['teagent']['proxy_port']	String	Proxy port	’0′
['teagent']['ip_version']	String	Ip version for the agent to run with (‘ipv4′ or ‘ipv6′)	‘ipv4′

Using the Cookbook

This cookbook can be included as a recipe or added to your role list. If you choose to include the teagent cookbook in your node’s run_list don’t forget to set the default attributes.

The simplest way of configuring the ThousandEyes private agent is assuming all the default values. The one parameter that you should always set is the account token:

{
 "teagent": {
     "account_token": "your_account_token_goes_here",
 },
 "run_list": ["recipe[teagent]" ] 
}

Installing the agent with BrowserBot support (which I highly recommend):

{
 "teagent": {
     "browserbot": true,
     "account_token": "your_account_token_goes_here",
 },
 "run_list": ["recipe[teagent]" ]
}

Some extra fonts might be required to properly display the result of a page load test. In that case you would want to install the agent with BrowserBot support and install the international language package:

{
 "teagent": {
     "browserbot": true,
     "international_langs": true,
     "account_token": "your_account_token_goes_here",
 },
 "run_list": ["recipe[teagent]" ]
}

You can also set the target log path location:

{
 "teagent": {
     "account_token": "your_account_token_goes_here",
     "log_path": "/var/log",
 },
 "run_list": ["recipe[teagent]" ]
}

The agent can work with an http proxy and in this case you will need to set both the host and the port:

{
 "teagent": {
     "account_token": "your_account_token_goes_here",
     "proxy_host": "proxy.example.com",
     "proxy_port": "8080",
 },
 "run_list": ["recipe[teagent]" ]
}

By default the ThousandEyes private agent will pick up your default IPv4 ip address (hence the default attributes list and values above). If you want to set the ThousandEyes private agent to use the default IPv6 ip address then you can set the ip_version to ‘ipv6’ as show in the example below:

{
 "teagent": {
     "account_token": "your_account_token_goes_here",
     "ip_version": "ipv6",
 },
 "run_list": ["recipe[teagent]" ]
}

Alternatively you include the teagent recipe to install and configure the ThousandEyes private agent. The only recipe you need to include is the default one.

include_recipe 'teagent'

Grab the recipes from our Github repo to start cooking ThousandEyes private agents with Chef!

The post Cooking ThousandEyes Private Agents with Chef appeared first on ThousandEyes Blog.

On August 15th 2013 a DDoS attack targeted Github.com. The attack was reported by many news organizations; here is an article “GitHub code repository rocked by ‘very large DDoS’ attack” by Jack Clark of The Register. In addition the Github Status Page from that day also provides some additional timing for this DDos attack. At 15:47 UTC the Status page reports “a major service outage”. At 15:50 UTC they state: “We are working to mitigate a large DDoS. We will provide an update once we have more information.”

ThousandEyes was monitoring GitHub through public agents before and during the attack and we captured the coordinated effort by GitHub, Rackspace and their ISPs to fend off the attack. While multiple ISPs were involved in the effort, for the sake of simplicity we will focus primarily Level 3 Communications, and AboveNet. They used different techniques to counter the attack and ThousandEyes Deep Path Analysis helped us understand how the attack and defense evolved over time.

Figure 1: This shows end-to-end network metrics from ThousandEyes agents reaching Github during the DDoS attack. All agents are reporting 100% packet loss.

Figure 1: This shows end-to-end network metrics from ThousandEyes agents reaching Github during the DDoS attack. All agents are reporting 100% packet loss.

The view above shows the end-to-end loss from several of our public agents around the world when testing Github site. At 15:40 UTC the loss was at 16.7% worldwide, five minutes later at 15:45 UTC the loss had increased to 100% from all locations. This is typically an indication that there’s something wrong at the network level. To understand better where the problem is occurring we need to go to the path visualization view.

In August, Gitub was hosted by Rackspace, and from our agents we were able to identify four different upstream ISPs connected to Rackspace carrying Github traffic. While we will focus on Level 3 Communications and AboveNet, we also monitored provider links from Qwest (CenturyLink). Their configuration has changed significantly since this event in August.

Case #1: Level 3 Communications

Let’s take a step back and look at the state of the network before the DDoS attack on Github, as shown in Figure 2. This is what the Path Visualization looked like earlier in the day on the 15th for traffic using the Level 3 Communications network.

ThousandEyes Path Visualization before DDos attack on Github
Figure 2: In the period before the attack, the Dallas, Ashburn, Philadelphia, and Raleigh agents all used Level 3 as a provider.

Figure 2: In the period before the attack, the Dallas, Ashburn, Philadelphia, and Raleigh agents all used Level 3 as a provider.

As we move forward, at 15:40 UTC we see nodes with loss, circled in red, in the Path Visualization. Specifically there is loss inside of Level 3 and Rackspace. Red nodes indicate that there’s packet loss in the adjacent links facing the destination. Figure 3 below shows though that there are still locations without loss (the green nodes on the left).

ThousandEyes Path Visualization; beginning of DDos attack on Github
Figure 3: During the attack the Path Visualization shows loss on some nodes but destination is still being reached.

Figure 3: During the attack the Path Visualization shows loss on some nodes but destination is still being reached.

Fifteen minutes later at 15:55 UTC none of the traffic is reaching its destination as Figure 4 below shows. All traffic from these agents is terminating inside of Level 3 Communications.

ThousandEyes Path Visualization; path visualization during Github DDos Atack
Figure 4: All locations now experiencing 100% packet loss in the Path Visualization

Figure 4: All locations now experiencing 100% packet loss in the Path Visualization.

Case #2: AboveNet

The data from another provider – AboveNet – tells us a similar story. At 15:45 UTC we detect five agents in Path Visualization routing through Abovenet to reach Github. There is a link inside of AboveNet with an average delay of 124 ms indicating some stress on their network, and the next three hops (one in AboveNet, two in Rackspace) are experiencing loss and we can see that the traffic is not making it to the destination.

Figure 5: Loss and latency in AboveNet and Rackspace during the DDoS attack in the Path Visualization view.

Figure 5: Loss and latency in AboveNet and Rackspace during the DDoS attack in the Path Visualization view.

Five minutes later at 15:50 UTC, Figure 6 below shows the loss enroute to the destination persists but now location of the loss is completely different. The traffic is no longer terminating inside of AboveNet or Rackspace. In fact it is never making it into either of their networks, it appears it is terminating on the ingress to AboveNet.

Figure 6: Red nodes on the far right represent traffic termination points, before AboveNet and Rackspace.

Figure 6: Red nodes on the far right represent traffic termination points, before AboveNet and Rackspace.

To investigate this further we can select one or more of these nodes, move to a period before or after the attack and see where that next hop is located. Figure 7 below shows us that two hours before the attack all nodes now dropping packets had next hops to AboveNet. This helps show us that there is some sort of destination-based filtering happening in AboveNet for Github traffic.

Figure 7: This image shows the Path Visualization before the DDoS attack on Github inside AboveNet, and the highlighted blue nodes are the AboveNet edge that during the attack is dropping traffic destined for Github.

Figure 7: This image shows the Path Visualization before the attack inside AboveNet, and the highlighted blue nodes are the AboveNet edge that during the attack is dropping traffic destined for Github.

ThousandEyes Path Visualization; During DDoS attack on Github all nodes 100% packet loss
Figure 8: All nodes inside of Level 3 now experiencing 100% packet loss during the attack.

Figure 8: All nodes inside of Level 3 now experiencing 100% packet loss during the attack.

Conclusion

It is difficult to say exactly what techniques were used to mitigate the attack without access to internal ACL lists or routing tables (or even iBGP feeds), or input from the teams involved in stopping this attack on Github. However, from the public statements on the GitHub Status Page, and from our own data it is clear there was a coordinated and cooperative effort to stop the DDoS attack, and place destination-based filters in routers inside the Github’s providers. These filters are often distributed through iBGP inside the ISP using a mechanism such as Remotely Triggered Black Hole (RTBH). For additional information about this technique go here: “Remotely Triggered Black Hole Filtering-Destination Based and Source Based” Cisco Systems, Inc. 2005 (PDF).

Regardless of the specific techniques used by site owners, their hosting companies or their ISPs to combat DDoS these efforts require some level of coordination and cooperation. ThousandEyes can help in understanding how effective the filters are in mitigating these attacks and how DDoS attacks evolve over time and across the network and even helping identifying the source of the attacks in some instances.

Additional Resources:

“ISP Security – Real World Techniques” by Gemberling, Morrow, and Greene. (PDF)
“Remotely Triggered Black Hole Filtering-Destination Based and Source Based” Cisco Systems, Inc. 2005 (PDF)
Remotely-Triggered Black Hole (RTBH) Routing by Jeremy Stretch, 2009

The post Using ThousandEyes to Analyze a DDoS Attack on Github appeared first on ThousandEyes Blog.

Blog-Measuring-Network-Bandwidth-Post-217x200
Measuring network bandwidth is key to understanding network performance and performing capacity planning. Iperf, a traditional network testing tool, measures TCP throughput; but, this is not the same as available bandwidth (or capacity, for that matter). For instance, iperf results are highly dependent on the default TCP parameters of the kernel, among other caveats. But one of the main limitations of iperf is that it requires server instrumentation. Wouldn’t it be great to have a tool to measure available bandwidth just by instrumenting the client?

Well, that’s precisely what we developed here at ThousandEyes. Our bandwidth estimation technology uses specially crafted and sequenced TCP packet trains that are sent between our agent (the client) and a specified server. Using this technique we’re able to periodically measure the end-to-end available bandwidth between your branch office and salesforce.com, for example. This only requires an open TCP port at the server side. Here are two additional use cases for bandwidth estimation that we commonly see our customers perform.

Use Case #1: Provisioning Network Capacity

The example below (figure 1) shows the capacity (dark blue) and the available bandwidth (light blue) time series on a network path between one of our agents in Mexico and a Wikipedia server located in California (www.wikipedia.org). Looking at the timeline, one can immediately see a daily pattern where there’s a decrease in available bandwidth starting at 8am Pacific, and an increase approximately 12 hours later at 8pm Pacific. This roughly aligns with the window when businesses are open in California. One can also see a local maximum of available bandwidth around 12pm which coincides with lunch time. This information is critical to understanding end-to-end network utilization and determining when to update the capacity of links inside your WAN.

Figure 1: Wikipedia HTTP Test with bandwidth measurement enabled

Use Case #2: Troubleshooting Poor Application Performance

The example below (figure 2a) shows an HTTP server test to www.espn.go.com that fetches the index page of the site periodically, as well as the time series of available bandwidth during the same time window (figure 2b). Note the correlation between the almost 20x increase in fetch time with a drastic decrease in available bandwidth.

Figure 2a: HTTP server test to www.espn.go.com with a spike in fetch time

Figure 2b: Available bandwidth is greatly decreased during the same time period

In addition to TCP-based tests, ThousandEyes also supports ICMP-based available bandwidth to a specified target. This is typically used when targeting network devices, such as routers and switches that don’t have open TCP ports. The above test is measuring bandwidth from the client to the server. If you are interested in measuring bandwidth from the server to the client, you can either place an agent at the server side and target the client, or alternatively you can setup an HTTP server test that fetches an object from the server (figure 3) in order to approximate available bandwidth by looking at TCP throughput (this is the curl/wget equivalent). Make sure you have a large enough object if you do this.

Figure 3: HTTP Test measuring available bandwidth of an object from a server

At ThousandEyes, we’ve spent considerable effort in developing a bandwidth estimation technique that works in black-box scenarios, such as when enterprise users access third-party SaaS applications. This extra visibility is important to understanding performance bottlenecks when using SaaS applications, especially when overlaid with our Path Visualization capability.

To see how bandwidth affects your application performance, without the hassles or constraints of server instrumentation, start your free trial of ThousandEyes today.

The post Measuring Network Bandwidth Without Server Instrumentation appeared first on ThousandEyes Blog.

451-Research-Logo Last month 451 Research, an independent analyst company focused on the business of enterprise IT innovation, published an Impact Report by Peter Christy and Dennis Callaghan. The report addresses the new challenges IT teams face as SaaS applications are being adopted in their organizations. They found ThousandEyes to be “an innovative and differentiated application to address the new problems” of SaaS performance.

From “ThousandEyes helps SaaS providers and customers see network problems”:

When a company changes from an in-house application to a SaaS equivalent, the transition may be nearly invisible to the application user, but dramatic in terms of application performance visibility for the IT team. The performance of the application proper is no longer directly visible because it is operated remotely by a different organization, and frequently neither is the performance of the connecting network as it evolves from an enterprise WAN connection to some combination of ISP and Internet connectivity.

These new network configurations require new tools in order to restore visibility to issues in network performance (a necessary precursor if poor application performance is to be remediated). The ThousandEyes application fills that gap, providing innovative methods for measuring network performance and localizing problems that occur on networks owned and operated by other organizations, such as the network service provider of the enterprise or of the SaaS application operator.

ThousandEyes has created a modern cloud-delivered application to help identify and remediate network issues that manifest themselves as performance problems with the use of SaaS applications, or (often equally important) demonstrate that the performance issues are not caused by the network. The company is one of the first vendors to address the new challenges that arise as enterprise application use evolves to SaaS applications.

ThousandEyes has incorporated performance-measurement software and services in its eponymous SaaS offering. It enables the network team of an enterprise using SaaS applications (such as salesforce.com), or a SaaS vendor, to view and analyze the performance of the network connections involved when enterprise users access SaaS applications; identify network-caused performance-related issues; and collaborate with partners, including network service providers, to remedy the problems.

ThousandEyes includes features that enable collaboration between the different teams that may be involved in the diagnosis and remediation of the network problems that ThousandEyes identifies (e.g., an enterprise network team, the enterprises network service provider, the SaaS team). While working within the tool, a network admin can easily enable a partner to use suitable functions in the application by sending the partner a URL – the application is SaaS and used through Web interfaces – that, when referenced, puts the partner into the application with the context relevant to the problem being addressed collaboratively, similar to how a Dropbox user can easily share a URL that enables another person to access a specific file or folder whether or not the other person is a Dropbox user.

For the full, in-depth report from 451 Research, including a contrast of ThousandEyes with legacy performance tools and the SWOT analysis, download the Impact Report on ThousandEyes directly.

The post 451 Research: IT Gets a New Tool to Solve SaaS Performance Challenges appeared first on ThousandEyes Blog.

DDoS attacks come in many shapes and sizes, making mitigation difficult and complex. The general gist is to overwhelm the infrastructure of the target — whether routers, bandwidth, DNS or application servers. Importantly, DDoS attacks can target both application and network layers.
DDoS Attacks by Type
The most common categories of DDoS attacks are:

Volumetric: Flooding the bandwidth and routers of an enterprise’s network. Examples include UDP floods, ICMP floods, DNS reflection, and NTP reflection attacks.
IP Fragmentation: Overwhelming network infrastructure by consuming and overloading memory as a server recombines non-initial packet fragments. Examples include TCP, UDP, and ICMP fragment attacks.
TCP Connection: Overwhelming a load balancer or application server by spawning lots of connections and holding them open, preventing legitimate traffic from creating connections. Examples include SYN flood attacks.
Application: Attacks to target and overwhelm connections to the application server with malformed requests and slow responses. Examples include HTTP GET and POST attacks.

Three Approaches to DDoS Mitigation

In order to thwart these varied and often concurrently used types of attacks, organizations use a combination of three mitigation strategies:

On-Premises: A variety of tactics can be employed, such as source or destination filtering using ACLs, remote-triggered black holes and intrusion prevention systems in order to reduce the volume of traffic continuing through the network. These may be implemented using a dedicated appliance or through load balancers placed at the network edge.
ISP Collaboration: Using similar tactics as an on-premises appliance, enterprises work with their ISPs to filter or black hole traffic before it even reaches their network.
Cloud-based Mitigation: During an attack, an enterprise will reroute traffic using DNS or BGP to a third party mitigation vendor who will use scrubbing centers to filter the traffic. The mitigation vendor will then pass along legitimate traffic to the enterprise network.

DDoS attacks typically manifest themselves to users as an unavailable service due to congested bandwidth, overloaded routers and overloaded application servers. With ThousandEyes it becomes clear where in the network this congestion is happening during the course of an attack, and which infrastructure is being overwhelmed.

Network Topology of an Attack on a Global Bank

Let’s look at an example of a real, volumetric DDoS attack against a U.S. bank, a type of attack that happens frequently. Prior to the attack, at approximately 3pm Eastern, the application availability looks fine as measured by ThousandEyes endpoints around the globe (Figure 1).

Figure 1: Before the attack, all agents showing zero errors and full availability

Figure 1: Before the attack, all agents showing zero errors and full availability

As the attack begins around 3:30pm, nearly half the endpoints around the world report connection failures to the banking site (Figure 2).

Figure 2: Agents begin turning red, signifying connection failures

Figure 2: Agents begin turning red, signifying connection failures

The application availability of less than 50% is due to network congestion (and potentially some filtering) that is causing widespread packet loss from nearly all of the endpoints (Figure 3). Due to the levels of packet loss, most of these endpoints simply cannot access the banking service.

Figure 3: ThousandEyes agents now showing connection errors to all global endpoints

Figure 3: ThousandEyes agents now showing packet loss to all global endpoints

Monitoring Mitigation Techniques During an Attack

At this point the bank’s DDoS mitigation measures kick into gear. As the attack is underway the bank uses a cloud-based mitigation service to redirect and filter, or ‘scrub’, the traffic so that it will no longer overwhelm their network. Despite the cloud-based mitigation, there is still significant packet loss during the attack both within the mitigation provider and the bank’s own network. In Figure 4, we can see that at this point during the attack there are three scrubbing centers filtering North American, Asian and European traffic before it gets to the bank’s network. However, one of the three scrubbing centers (Scrubbing Center 1) appears to be struggling with the traffic volume, or is excessively filtering traffic.

Path visualization during DDoS mitigation showing scrubbing centers

Figure 4: Path visualization during DDoS mitigation. Nodes in the DDoS mitigation vendor’s network are highlighted in yellow. See three points indicating scrubbing centers.

Despite the use of a cloud-based DDoS mitigation, and potentially other mitigation strategies, application availability does not stabilize for nearly 12 hours. Monitoring the efficacy, and fine tuning strategies of DDoS mitigation throughout an attack is crucial to ensure that application and website users have minimal impact. In this situation, having insight into the performance of external providers such as ISPs and mitigation vendors can be invaluable in keeping service levels high.

Find out more about using monitoring and analyzing DDoS attacks using ThousandEyes with a downloadable PDF, ThousandEyes for DDoS Attack Analysis. Read on to Part 2 where we will discuss monitoring BGP routing changes during a DDoS attack.

The post Visualizing Cloud Based DDoS Mitigation appeared first on ThousandEyes Blog.

Recognized for Distinctive Capabilities to Troubleshoot Complex Cloud Performance Issues

ThousandEyes, who provides IT performance management for the cloud era, today announced that it has been awarded with the coveted Vendor to Watch award from Enterprise Management Associates (EMA), a leading industry analyst firm. EMA Vendors to Watch are companies that provide unique customer value by solving problems that have previously gone unaddressed.

“ThousandEyes is a new and distinctive take on managing applications from the network layer,” said Julie Craig, research director at EMA. “A unique combination of features positions the product in the forefront of hybrid Cloud management solutions and differentiates it from traditional ‘performance management’ solutions. The impressive list of well-known customers already on board drives this point home in a big way. A combination of DNS, Web, Network, and BGP testing, and particularly its ability to use BGP to reveal physical routes, is likely one reason why it already counts multiple eCommerce, SaaS, and customer-facing enterprise vendors as customers.”

ThousandEyes provides detailed visibility into the performance of cloud applications and helps IT teams resolve performance problems quickly. Customers include members of the Fortune 500, 7 of the top 10 SaaS providers and others. To see what customers are saying about ThousandEyes, go to: http://www.thousandeyes.com/customers

“We are honored to be named a Vendor to Watch by EMA and are excited to reinvent network performance management for Cloud applications,” said Mohit Lad, co-founder and CEO of ThousandEyes. “We are also pleased that EMA recognizes the need for innovative Cloud management techniques and products and sees ThousandEyes as a distinctive solution in this space.”

To download the report, simply click the link, “EMA Vendor to Watch: ThousandEyes Determines “Who to Call” for Complex Cloud Performance Issues.

About ThousandEyes 
ThousandEyes provides IT performance management for the cloud era. The company’s solution provides detailed visibility beyond the corporate network perimeter, identifies the root cause of performance problems with cloud applications and enables distributed collaboration to resolve problems quickly. ThousandEyes is backed by Sequoia Capital and headquartered in San Francisco, CA. For more information, visit www.thousandeyes.com or follow on Twitter at @ThousandEyes.

About EMA 
Founded in 1996, Enterprise Management Associates (EMA) is a leading industry analyst firm that provides deep insight across the full spectrum of IT and data management technologies. EMA analysts leverage a unique combination of practical experience, insight into industry best practices and in-depth knowledge of current and planned vendor solutions to help its clients achieve their goals. Learn more about EMA research, analysis and consulting services for enterprise line of business users, IT professionals and IT vendors at www.enterprisemanagement.com. You can also follow EMA on Twitter or Facebook.

Media and Analyst Contact 
Amber Rowland 
amber@therowlandagency.com 
+1-650-814-4560

The post Press Release: ThousandEyes Named a ‘Vendor to Watch’ appeared first on ThousandEyes Blog.

Continuing our discussion about visualizing DDoS attacks from last week, today we are going to look at an attack against a multinational bank. Whereas last week’s example focused on path visualization, this week’s will touch upon how Border Gateway Protocol (BGP) plays a role in rerouting traffic during an attack.

A quick aside on BGP. BGP is an internet routing protocol that broadcasts which Autonomous Systems (AS), large networks connected to the Internet, are reachable from other networks. In this way routers know where to forward packets in order to reach a destination network. Links between networks are ever changing due to hardware failures, downed links, and changes in peering between networks. BGP can also be used to redirect traffic during a DDoS attack to scrubbing centers to filter out malicious traffic, particularly centers operated by cloud-based mitigation vendors.

Let’s join a DDoS attack in progress, with widespread service degradation and packet loss clearly visible in our network metrics (Figure 1).

DDoS begins signaled by global packet loss

Figure 1: Bank website experiencing packet loss from locations around the world.

In response to the DDoS attack, the bank begins rerouting traffic from their own network to that of their cloud-based DDoS mitigation vendor. This is evident from the BGP path changes that are being advertised, switching from the bank’s Autonomous System to that of its mitigation provider in order to begin scrubbing of traffic. In Figure 2, we see BGP path changes propagate, as the previous route to the bank (the white circle) via their ISP, Verizon Business (AS 701), is changed over to new routes to their mitigation vendor (the green circle).

Beginning of DDoS mitigation visualized

Figure 2: Bank uses BGP to reroute traffic from their own Autonomous System (AS) to that of their DDoS mitigation provider.

This changes routed traffic through several global scrubbing centers, as visible on the Path Visualization view. In Figure 3, we can see these scrubbing centers located in Europe and the US, each handling traffic from different regions around the world, listed on the left. The bank’s website is the green circle on the far right.

DDoS mitigation scrubbing centers visualized

Figure 3: During mitigation traffic is routed through scrubbing centers, each serving geographic regions.

Within minutes the effect on application performance is clear, with packet loss dropping dramatically and availability improving to 100% (Figure 4). The DDoS mitigation vendor continues to filter traffic in order to stave off the attack.

DDoS mitigated packet loss drops

Figure 4: After mitigation is underway, packet loss returns to normal.

After the attack has subsided almost 24 hours later, the bank uses BGP to advertise new routes to its network and to no longer use the networks of its DDoS mitigation provider. In Figure 5, we see new routes to the bank’s network (in green) via two upstream ISPs (in gray) as well as the old routes that used to direct to the mitigation vendor (in white).

Rerouting BGP for DDoS mitigation

Figure 5: Once the attack is over, the bank changes BGP paths back to their own network from that of their DDoS mitigation provider.

Network Visualization of DDoS Attacks

This example shows a relatively successful response to a major DDoS attack. In both this example of a successful mitigation and the previous of a mitigation that had more mixed results, the importance of network visualization during a DDoS is clear to effectively communicate with network operations teams and various vendors involved in the response.

Visibility into an ongoing DDoS attack is critical given how many moving pieces there are. Networks are overloaded and under stress. New DNS records and BGP routes are being advertised to reroute traffic for filtering. Access control lists are being updated to filter out traffic. And the attackers are evolving their attack vectors continuously. During a DDoS attack you’ll want a toolset that can monitor global availability and real-time performance, ensure DDoS mitigation is being deployed correctly, and get continuous insight into mitigation efficacy

Find out more about monitoring and analyzing DDoS attacks using ThousandEyes with a downloadable PDF ThousandEyes for DDoS Attack Analysis.

The post Using BGP to Reroute Traffic during a DDoS appeared first on ThousandEyes Blog.

Many organizations struggle when it comes to proper user access management to applications, systems, and services. It was already challenging to properly provision, de-provision, and manage access to all internal applications, and when cloud applications came into play these tasks became resource consuming and associated risks increased. Forrester Research states that the average help desk labor cost for a single password reset is about $10 to $15 each, while improper access removal might be grave to the company or might pose significant fees for non-compliance, for example, with PCI DSS, Sarbanes-Oxley Act, HIPAA, FCC regulations and many other industry and regulatory standards. In many companies today employees have to remember more than two passwords. As a result, 81% of companies cite complex password policies as the single biggest user complaint about access. ¹

Single user, multiple user accounts

Figure 1: Single user – multiple user accounts

Figure 1 illustrates a user accessing multiple applications with multiple accounts. Each account needs to be properly managed: created, deleted, passwords need to be securely reset.

Identity Centralization and Federation

Companies are looking to reduce the number of passwords that employees must remember by utilizing technologies such as Microsoft Active Directory identity management solution² based on SAML or other directories, federated identity applications and protocols. Technologies to support this direction are quite mature and first implementations lead to year 2002,³ however adoption is somewhat behind and not many Software-as-a-Service (SaaS) providers support them today. For example, Amazon only recently announced support for SAML⁴ and many still fall behind and do not offer such functionality.

ThousandEyes recognizes importance of enterprise identity management and supports SAML natively from its 2013-07-10 release⁷. Figure 2 demonstrates a user accessing multiple applications with the same account based on federated identity management and centralized directory.

Single user, single sign on account

Figure 2: Single user – single account

The main benefits of such scenario are decreased costs for password management, improved security (as long as SAML and underlying XML infrastructure itself is secure) and lower risk of being in non-compliance with internal security standards or external laws and regulations.

ThousandEyes with Microsoft Active Directory Federation Services

Previously we have explained how to configure ThousandEyes for federated Web SSO using Okta as identity provider.^6,10 Now we’ll review configuration for Microsoft Active Directory Federation Services (AD FS) 2.0.

AD FS is a Microsoft implementation of federated identity services based on SAML protocol.^8,9 AD FS was first released with Windows Server 2003 R2, all following releases of Windows Server also include this functionality.

How SAML works is described in a number of publications including “SAML Single Sign-On (SSO) Service for Google Apps by Google.”⁵ For the purpose of this document I have created a virtual lab with one domain controller (addc.les.local), one AD FS server (adfs.les.local) running Windows Server 2012 R2, and one Windows 8.1 client computer.

In my Lab internal Web SSO service was assigned an address https://sso.les.local where user is presented a default AD FS web page (similar page might be hosted in perimeter network for access over internet using AD FS Proxy). Figure 3 below demonstrates the page after implementing configuration specific to ThousandEyes, which we’ll review in detail in Part 2 of this post.

federated web sso page

Figure 3: Federated Web SSO page

We will focus on implementation of the following scenario: The user from Les.Local logs into their Web SSO at https://sso.les.local (Les Local SSO) using integrated Windows authentication and clicks login into ThousandEyes. AD FS returns a SAML assertion to the browser of the user. The browser redirects to https://app.thousandeyes.com and submits SAML assertion. The user is logged into ThousandEyes.

Please keep in mind, currently user creation using SAML is not supported; all user accounts must be already registered within ThousandEyes application. By implementing this scenario we are replacing local ThousandEyes application authentication with corporate Active Directory-based authentication. We’ll be publishing detailed configuration instructions to the blog this week, but in the meantime, get started preparing your ThousandEyes application and SSO service from the ThousandEyes Customer Success Center.

References

^{1. “Enhancing Authentication TO Secure Open Enterprise, PDF” December 2010, Forrester Research, Inc.↩}

^{2. “Guide to the Sarbanes-Oxley Act: Managing Application Risks and Controls, PDF” 2006, Protiviti, Inc.↩}

^{3. Geyer, Carol. “History of SAML.” 2007, OASIS.↩}

^{4. Terman, Elias. “Amazon Adoption of SAML a Big Step Forward for Authentication Protocol.” 2007, OASIS.↩}

^{5. “SAML Single Sign-On (SSO) Service for Google Apps.” 2013, Google.↩}

^{6. Fraleigh, Dave. “How to configure Single Sign-On with Okta.” July 2013, ThousandEyes.↩}

^{7. Fraleigh, Dave. “Release update 2013-07-10.” July 2013, ThousandEyes.↩}

^{8. Pierson, Nick. “Windows Server 2012 AD FS Deployment Guide.” February 2012, Microsoft.↩}

^{9. “Active Directory Federation Services.” 2012, Microsoft.↩}

^{10. Rodrigues, Nelson. “SAML based SSO with ThousandEyes and Okta.” November 2013, ThousandEyes.↩}

The post Identity Management for the Cloud Era – Part 1 appeared first on ThousandEyes Blog.

federated web sso page
Picking up from Part 1: ThousandEyes with Microsoft Active Directory Federation Services, you already have both AD FS and your ThousandEyes application running and accessible. I am going to use this link to verify my installation works: https://sso.les.local/adfs/fs/federationserverservice.asmx. It will return an XML page if everything is OK.

If you are doing a new AD FS installation, here are couple things to remember. In AD FS from 2012 R2, there is no need to install Internet Information Services, it’s not required anymore. The SSO service name must be different from the server name, in my case they are sso.les.local and adfs.les.local. Due to Kerberos and service principle names registration requirements, make sure DNS resolution works (create A record for sso.les.local).

When AD FS is installed it automatically selects it’s Active Directory as identity provider, this is where all user accounts will be stored. You can verify it using AD FS Management – select “Trust Relationships”, under “Claims Provider Trusts” you will see “Active Directory” as Enabled.

The next step is to configure ThousandEyes as a service provider using AD FS Management – select “Trust Relationships”, under “Relying Party Trusts”.

1. Open AD FS management console, click “Certificates” under “Service” under “AD FS”. Double click Token-signing certificate to open it and click Details tab of the certificate as depicted on figure below.

Token signing certificate export

Figure 1: Token signing certificate export

2. Click “Copy to file”, this will launch “Certificate Export Wizard”. Save certificate into a file of DER format (.CER), for example, “C:\Temp\Token-signing-certificate.cer”

3. Login into ThousandEyes application using your Organizational Admin account and navigate to “Settings” – “Account” – “Security & Authentication”; select check box – “Enable Single Sign-On”

ThousandEyes SSO configuration

Figure 2: ThousandEyes SSO configuration

4. Here are the entries you’ll have to make (replace “sso.les.local” with your Web SSO service name):

Login Page URL: https://sso.les.local/adfs/ls/
Logout Page URL: This is an optional parameter. You can point your users after they logout from ThousandEyes to your Web SSO application at https://sso.les.local/adfs/ls/idpinitiatedsignon.htm
Identity Provider Issuer: https://sso.les.local/adfs/services/trust
Service Provider Issuer: http://www.thousandeyes.com
Verification certificate: click “Choose File” and select certificate export file you created at step 2 – “C:\Temp\Token-signing-certificate.cer”

5. Click Save; this will save your SSO configuration.

Now we’ll finish AD FS configuration.

6. In AD FS management console (sso.les.local) under “Trust Relationships” right click “Relying Party Trusts” and select “Add Relying Party Trust…”. This will launch “Add Relying Party Trust Wizard”. Click Start on the welcome page.

7. On the next page select Enter data about the relying party manually, as depicted below and click Next.

Figure 3: Add Relying Party Trust Wizard

8. Enter “Display name” ThousandEyes and hit Next. This is the application name you see in the combo box on the SSO login page.

9. Leave AD FS Profile selected to activate SAML 2.0 and hit Next.

10. Do not make any changes on the next page (certificate selection) and hit Next.

11. On the next page select “Enable support for the SAML 2.0 WebSSO Protocol” checkbox and enter the following string as service URL: https://app.thousandeyes.com/login/sso/acs. Hit Next.

Relying party SAML 2.0 SSO Service URL

Figure 4: Relying party SAML 2.0 SSO Service URL

12. Enter “Relying party trust identifier” as it was entered on step 4 (“Service Provider Issuer” in ThousandEyes application): http://www.thousandeyes.com and hit Add and Next

Figure 5: Relying party trust identifier

13. Next step offers to configure multi-factor authentication. For the purpose of this lab scenario I did not configure strong authentication, select “I do not want to configure multi-factor authentication for this relying party trust at this time.” and hit Next

14. Next page offers to create authorization rules. For the purpose of this lab scenario I did not configure authorization rules, select “Permit all users to access this relying party.” and hit Next

15. Review information on summary page and hit Next, then Close

16. “Edit Claim Rules” dialog opens. Now you need to define how you will map your internal Active Directory users to ThousandEyes users. At ThousandEyes we expect to see an attribute called “Name ID” and it must be equal to an email as registered in ThousandEyes. For the purpose of this lab scenario I am mapping “User Principle Name” (UPN) into “Name ID”. In your real case scenario this probably will be an email address.

17. Click Add Rule, select “Send LDAP Attributes as Claims”, Claim rule name “UPN as NameID”, Attribute store – select Active Directory, on the left side of the table select “User-Principle-Name”, on the right side – “Name ID” and hit Finish.

Claim rules

Figure 6: Claim rules

This step completes configuration. Now it is time to test.

I have created a test user in Active Directory and ThousandEyes. The UPN of Active Directory user equals to email of ThousandEyes user. On my Windows 8.1 workstation I am logging in as this user, opening Web browser and navigating it to the Les Local SSO page at: https://sso.les.local/adfs/ls/idpinitiatedsignon.htm. Now select ThousandEyes from the list of your service provider applications and hit “Sing in”, this will sign you into ThousandEyes application using your local Active Directory credentials and SAML in federated Web SSO scenario.

Now if everything works as expected you can login into ThousandEyes and force use of SSO, this will prevent users from using local ThousandEyes accounts and enforce federated identity. Please note, Organization Admins will still be able to login using their ThousandEyes local account.

Force SSO

Figure 7: Force SSO

Troubleshooting

To troubleshoot issues you can enable AD FS trace in Windows Server Event Viewer. Also you need to know the following URL https://app.thousandeyes.com/login?breakSso to stop using Web SSO and to return to use of local ThousandEyes accounts. Feel free to ask me any questions in the comments field below or get help in the ThousandEyes Customer Success Center.

The post Identity Management for the Cloud Era – Part II: AD FS Configuration appeared first on ThousandEyes Blog.

In this post I’ll cover in more detail the route monitoring capabilities of ThousandEyes, already touched in previous blog posts on how routing changes impact performance, and BGP for DDoS mitigation. Routing is a key determinant of network performance; each route that packets take has varying latencies and throughput. And when routing goes wrong, it can prevent packets from getting to their destination. Therefore, understanding routing across networks, specifically using Border Gateway Protocol (BGP), is critical to troubleshooting traffic flows that traverse large corporate networks or the public Internet.

A Brief Intro to the Border Gateway Protocol (BGP)

The Internet consists of a myriad of independent networks organized into Autonomous Systems (AS). Each AS typically represents an independent administrative domain managed by a single organization and identified by a 4-byte number, e.g. AS 7018 is AT&T, AS 701-703 is Verizon, etc. Inside each AS there are a series of border routers (e.g. 2a-2c in Figure 1) that typically connect to each other in a full mesh using iBGP (i=internal; reflectors and confederations can be used to relax this constraint). Border routers in different ASes connect to each other through eBGP (e=external) sessions. BGP is used to announce reachability to a chunk of IP addresses (or prefix). BGP defines more than just physical interconnections; it is used to advertise which routes are possible based on policies defined by other considerations such as traffic engineering, maintenance, and commercial transit and peering agreements.

For example, pinterest.com resolves to the IP address 23.23.131.240. If we look at routing tables for announced address blocks that cover this IP address, we find it falls under address block 23.22.0.0/15 announced by AS 14618 belonging to Amazon.

$whois -h whois.cymru.com " -v 23.23.131.240 "
AS 	| IP 	| BGP Prefix 	| CC | Registry | Allocated | AS Name 
14618 | 23.23.131.240	| 23.22.0.0/15 	| US | arin 	| 2011-09-19 | AMAZON-AES - Amazon.com, Inc.

Looking at RouteViews route server (telnet://route-views.routeviews.org) we can check the different AS paths available to reach 23.22.0.0/15. In the case below there are 31 available routes to reach the destination, but the router only picks one, the BGP best path, which is selected after looking at several route attributes, including BGP Local Preference and AS Path length.

route-views> sh ip bgp 23.23.131.240
BGP routing table entry for 23.22.0.0/15, version 636446191
Paths: (31 available, best #8, table Default-IP-Routing-Table)
  Not advertised to any peer
  3277 39710 9002 16509 14618
	194.85.102.33 from 194.85.102.33 (194.85.4.4)
  	Origin IGP, localpref 100, valid, external
  	Community: 3277:39710 9002:9002 9002:64789
  852 16509 14618
	154.11.98.225 from 154.11.98.225 (154.11.98.225)
  	Origin IGP, metric 0, localpref 100, valid, external
  	Community: 852:180
  3356 16509 14618
	4.69.184.193 from 4.69.184.193 (4.69.184.193)
  	Origin IGP, metric 0, localpref 100, valid, external
  	Community: 3356:3 3356:22 3356:100 3356:123 3356:575 3356:2006 65000:0 65000:7843
	...

eBGP sessions glue different ASes together, and iBGP sessions connect routers within the same AS.

Figure 1: eBGP sessions glue different ASes together, and iBGP sessions connect routers within the same AS.

External BGP Visibility (outside-in)

Public sources of BGP data, including RIPE-RIS in Europe and RouteViews in the U.S. establish eBGP sessions with hundreds of routers across the world (monitors) and provide a comprehensive picture of global routing reachability for a certain prefix (outside-in). This is the picture ThousandEyes typically represents in our BGP Route Visualization. For instance, in Figure 2 AS 36175 (ancestry.com) is announcing prefix 66.43.20.0/22 to 2 upstream providers XO Communications (AS 2828) and American Fiber (AS 31993). Each of the small green circles represent a router (or monitor) that is proving public BGP feeds. In the timeline, we are representing the average number of path changes per monitor; other metrics such as reachability and number of updates are also available. In this case, we noticed there’s a bump in the number of path changes at 6:00 UTC. If we zoom into that instant of time (Figure 3), we can see that there was a route change from AS 2828 (XO) to AS 31993 (American Fiber).

Ancestry.com (AS 36175) route visualization.

Figure 2: Ancestry.com (AS 36175) route visualization.

A routing change between different providers caused packet loss across the network for AS 36175.

Figure 3: A routing change between different providers caused packet loss across the network for AS 36175.

Internal BGP Visibility (inside-out)

We recently released the capability of visualizing both public and private eBGP routes for our customers. This means that any of our customers can setup a multi-hop eBGP session between one of their BGP speakers and our route collectors. There are two main benefits:

+++1. Internal prefixes: for prefixes originated inside the network, the private feed is useful to triage problems whose root cause is inside the network versus problems that originate outside; users will be provided with a single view of public and private feeds.
+++2. External prefixes: for prefixes belonging to a third party (e.g. a SaaS provider), the private feed is useful to detect cases where the route to the destination is sub-optimal (which affects performance of the application), or the route is taking a detour to a malicious destination (route hijacking).

Figure 4 below shows an example of one of our customers’ internal prefixes as seen by a combination of public and private BGP monitors. The small green double circle is a private BGP monitor. We can see that there are two origin Autonomous Systems in this view (the big green circles in the middle), but private AS 64999 in this case is only seen by the private monitor, and it’s not exposed to the other monitors.

Monitoring iBGP routes

Figure 4: Integrating private and public BGP feeds into a single view.

Setting Up Private BGP Feeds in ThousandEyes

Setting up a private BGP feed with us is pretty straightforward. You just need to go to “Settings -> My Domains & Networks -> Private BGP Monitors” and complete the form indicating your router IP address and ASN, and we will coordinate with you to bring the session up (Figure 5). You can check the status of your sessions in the table at the bottom of this page as well.

ThousandEyes BGP tests

Figure 5: Configuring a new private BGP session with ThousandEyes.

With the combination of both public and private eBGP visibility, ThousandEyes provides a greater understanding of routing issues that occur within a corporate network as well as issues with external prefixes. This information can help reduce latencies, spot inefficient routes, troubleshoot incorrect routing changes, and detect hijacked routes. Start monitoring BGP routes in ThousandEyes by signing up for a free trial.

The post Monitoring BGP Routes with ThousandEyes appeared first on ThousandEyes Blog.

In this article I’ll do a deep dive into some of the not-so-obvious capabilities of Deep Path Analysis, focusing on troubleshooting Path MTU and TCP MSS measurements. These two parameters are important in determining how application streams are split into different packets so that network interfaces across a path can forward them, and end-hosts are able to piece them together, both at the network (MTU) and transport (MSS) layers.

Maximum Transmission Unit and Path MTU

The Maximum Transmission Unit (MTU) of a network interface is the maximum packet size (in bytes) that the interface is able to forward. For example, for Ethernet v2, the MTU is 1500 bytes, but if Jumbo Frames are used the MTU can go up to 9000 bytes. Having a larger MTU has benefits since fewer packets are needed to transfer the same amount of data, and per packet processing is minimized. However, since packets are now bigger, per flow delays between packets are now also higher, so the minimum delay increases which is not good for real-time applications like, for example, voice.

The Path MTU (PMTU) between two end-hosts is the minimum MTU of all the interfaces used to forward packets between them. There’s a standard method called Path MTU Discovery (PMTUD) that is used by end-hosts to determine the PMTU of a connection. Both IPv4 and IPv6 standards impose a lower limit on the path MTU (summarized in the table below). Also, IP standards define a minimum datagram size that all hosts must be prepared to accept, which for IPv4 is 576 bytes. This means that all IPv4 end-hosts in the Internet are required to be able to piece together datagrams up to 576 bytes in length, even though practical values are much higher in reality (at least 1500 bytes).

Media	MTU (bytes)	Min. datagram size at end-hosts (bytes)
Internet IPv4 Path MTU	68	576
Internet IPv6 Path MTU	1280	1280
Ethernet v2	1500	-

IP Fragmentation and Path MTU

In IPv4, routers in the middle of a path are able to fragment IP datagrams if the DF (Don’t Fragment) flag in the IP header is not set. Packets are split into smaller fragments that need to be reassembled at the receiving host. However, IP fragmentation should be avoided whenever possible because of its drawbacks:

IP fragments are often dropped by firewalls, load balancers and stateful inspection devices; this is to prevent DDoS attacks since middle boxes often need to process the entire datagram before they can forward it, requiring state keeping; a recent DNS poison attack surfaced that exploits IP fragmentation.
IP fragmentation can cause excessive retransmissions when fragments encounter packet loss as TCP must retransmit all of the fragments in order to recover from the loss of a single fragment.
For cases where LAG is used, equal-cost multi-path (ECMP) typically uses fields from the 5-tuple to balance packets between different interfaces. With fragmentation only the first fragment carries 5-tuple information. This would mean subsequent fragments can be routed to different interfaces and arrive before the first fragment, and would be dropped by some security devices.

To avoid fragmentation, end-hosts typically run PMTUD for each destination, or are able to process ICMP “Packet too big” messages (default on Linux), and adjust the maximum TCP segment size (MSS) accordingly. The TCP MSS is the maximum TCP segment that can be sent at a time between clients and servers, and it’s advertised as TCP option by each end of the connection. The MSS values are used to set the max. TCP segment size, typically picked as the minimum between the local MSS and the received MSS. By default most hosts advertised their TCP MSS as the local MTU minus headers (40 bytes for IPv4 and 60 bytes for IPv6). So a common value for TCP MSS is 1500-40=1460 bytes. However this does not prevent IP fragmentation since there can be paths with PMTU lower than 1500, e.g. GRE tunnels. Another (perhaps more efficient) technique is MSS clamping, where middle boxes actually change the value of MSS in active TCP connections. In any case, there are three main problem cases that can happen:

Oversized TCP MSS: Whenever the TCP MSS+headers is greater than the PMTU, ICMP “Packet Too Big” messages will be received and max. datagram size is adjusted. This means that multiple TCP packets can be dropped while the ICMP packet is not received, causing TCP retransmissions and extra delays.
Undersized PMTU: For UDP, the client performs IP fragmentation every time it needs to send a datagram larger than the PMTU. A typical case where this can happen is DNS. If PMTU < 576 bytes, initial DNS packets will be lost until the sender receives the ICMP “Packet Too Big”. It’s up to the application to detect this and retransmit. After receiving the ICMP packet, Linux updates its cache of the PMTU result per destination which can be seen with the command “ip show route cache”. The retransmitted datagram is going to suffer IP fragmentation to accommodate the smaller PMTU.

PMTUD failure: In IPv6, fragmentation is only done by the end-hosts, which means that end-hosts should be able to run PMTU to a destination and adjust the size of IP packets accordingly. If PMTU is failing under IPv6, which typically occurs when ICMP messages are filtered (PMTU black holes), you are about to enter a world of pain, since your IPv6 traffic will be dropped in the floor without further notice. Therefore, failed PMTU is a big problem in IPv6 connections. In IPv4 a similar problem happens if the Don’t Fragment flag is set (default in Linux).

Protocol	Undersized PMTU	Oversized TCP MSS	Undersized TCP MSS	PMTUD failure
IPv4	PMTU<576 bytes IP fragmentation for UDP packets; critical in DNS servers	MSS > PMTU – 40 bytes Will trigger a Can’t Fragment ICMP from the first TCP packet	MSS < 1380 bytes (except DNS) Will generate extra packets increasing per packet processing.	Large packets that have DF bit set can potentially always be dropped (pmtu black holes)
IPv6	PMTU<1280 bytes Violation of IETF standards; packets will be dropped	MSS > PMTU – 60 bytes Will trigger a Packet too Big ICMP from the first TCP packet	MSS < 1220 bytes (except DNS) Will generate extra packets increasing per packet processing.	Large packets can potentially always be dropped (pmtu black holes)

PMTU and MSS in Path Visualization

Whenever you go to the Path Visualization view in our platform, you can check the PMTU and MSS values that each agent is using when connecting to a target, as shown in Figure 1.

PMTU and MSS in Path Visualization

Figure 1

In Figure 1, you can see the San Francisco agent is working with an MSS of 1380 bytes, and the result of PMTUD is 1500 bytes. This is the normal scenario where PMTU > MSS+headers. However, we detect several cases where either PMTU, MSS, or the combination of the two might create problems. Let’s look at some of these cases.

Case #1: Oversized TCP MSS

Figure 2 shows a case where the TCP MSS + headers is actually higher than the Path MTU. In the figure, all 3 IPv6 agents are using a combined TCP MSS of 1440 bytes, meaning the minimum between the MSS sent by the server and the MSS of the agent is 1440 bytes. This means that the client can send packets as large as 1500 bytes to the server. But given that the Path MTU is actually 1400 bytes, the client will receive a “Packet too big” ICMP message, and reset its maximum segment size to 1340 bytes. As you can imagine this process is not very efficient and contributes to performance degradation in the long run.

Oversized TCP MSS in Path Visualization

Figure 2

Case #2: Failed Path MTU Discovery

Figure 3 shows a case of a ThousandEyes agent in Boston that is not able to perform PMTUD over an IPv4 route. This case is problematic because it means ICMP “Can’t Fragment” messages are being dropped, and for paths with low PMTU, packets are just going to be dropped.

Failed Path MTU Discovery in Path Visualization

Figure 3

Case #3: Pinpointing Links with Low MTU

Figure 4 shows how Path Visualization can be used to actually pinpoint the link that had a drop in the MTU (1400 bytes), and generated an ICMP Packet Too Big message. This information is very useful to detect tunnel entry points, e.g. IP-in-IP tunnels, GRE Tunnels, IPSEC tunnels, etc.

Pinpointing Links with Low MTU in Path Visualization

Figure 4

Getting Started with PMTU and MSS Visualization

PMTU and MSS information is essential to troubleshoot enterprise networks, particularly to understand performance issues surrounding VPNs, IPv6 and tunneling. As more networks roll out IPv6 and jumbo-capable Ethernet, PMTU will increasingly become a metric that network operators will want to review. And PMTU path visibility can also help with more mundane problems, such as when links have the wrong MTU set. Get started with a free trial of ThousandEyes today to gain visibility into PMTU and MSS with Path Visualization.

The post Troubleshooting Path MTU and TCP MSS Problems appeared first on ThousandEyes Blog.

A solid Privacy Management System shows our employees, customers, vendors and other third parties our commitment to respect individual’s privacy rights and expectations and protection of collected personal data from unauthorized access, use, retention/storage and/or disclosure. Our Privacy Management System applies to all our technologies and business processes and is based on the following ten principles:

Management responsibility;
Notices to Individuals relating to the processing of his/her personal data;
Informing Individuals of the available choices (opt-in/opt-out);
Disclosing personal data to third parties only for valid business reasons and if required individual’s implicit or explicit consent is obtained;
Security of personal data;
Collecting personal data only for the specific purpose, the purpose must be disclosed;
Restricted use of collected data – only for the period of time required to fulfill the purpose;
Providing Individuals with access to their personal data for review and updates;
Quality of the personal data;
Enforcement of the principles is a responsibility of every ThousandEyes employee and the management team.

Today ThousandEyes has achieved an important milestone in its Privacy Management; we have completed third party evaluation of our Software-as-a-Service application and marketing practices.

ThousandEyes has been awarded TRUSTe’s Privacy Seal signifying that this privacy policy and practices have been reviewed by TRUSTe for compliance with TRUSTe’s Program Requirements and the TRUSTed Cloud Program Requirements including transparency, accountability and choice regarding the collection and use of your personal information. TRUSTe’s mission, as an independent third party, is to accelerate online trust among consumers and organizations globally through its leading privacy trustmark and innovative trust solutions.

ThousandEyes also certified compliance with U.S.-EU and U.S.-Swiss Safe Harbor frameworks as set forth by the U.S. Department of Commerce regarding the collection, use, and retention of personal data from European Union member countries and Switzerland. ThousandEyes has certified that it adheres to the Safe Harbor Privacy Principles of notice, choice, onward transfer, security, data integrity, access, and enforcement. To learn more about the Safe Harbor program, and to view ThousandEyes’ certification, please visit www.export.gov/safeharbor.

If you have questions regarding our privacy policy or practices, please contact us at privacy@thousandeyes.com

The post ThousandEyes Privacy Management System appeared first on ThousandEyes Blog.

The ThousandEyes team spent several days last week at Interop monitoring the health of InteropNet, the volunteer-built network that powers the conference. We set out to instrument InteropNet in order to monitor the health of services used by conference attendees and vendors. Over the course of the week while the network was up and humming, we gathered performance data and dug into network problems. Let’s take a look inside InteropNet to see how it performed.

Instrumenting InteropNet

Gathering performance data for InteropNet was a relatively simple exercise. Our in-house InteropNet guru and ThousandEyes solution engineer Ken Guo installed six agents to generate tests and record performance data. These six agents, virtual appliances running on Dell servers, were distributed across InteropNet. Four were located on VLANs that served the exhibit hall and conference rooms (PEDs). Two were located in InteropNet data centers, one in Sunnyvale and the other in Las Vegas. With these six agents deployed, we were able generate a view of the InteropNet topology.

6 private agents installed

Figure 1: InteropNet topology with agents in green and network interfaces in blue. Larger blue circles represent devices with multiple interfaces. Links between devices are shown in gray, with thicker lines representing more paths from the agents to external services.

Interface grouping to identify devices

Figure 2: Using the same InteropNet topology as Figure 1, here we can see the locations of agents as well as interfaces in blue grouped by the device they are associated with (Avaya and Cisco switches and routers) leading to primary ISP CenturyLink (Qwest).

Testing InteropNet Performance

We set up a number of tests that actively probed services on InteropNet, including:

From InteropNet to key services: mobile app, registration server, social media sites, Salesforce.com and Webex
From InteropNet to the LV and SFO data centers as well as EWR and DEN edges
From POPs around the US to the Interop registration site, website and CenturyLink circuits
Local DNS resolver
Authoritative DNS server for Interop.com and Interop.net
Interop BGP prefixes

Troubleshooting Critical Services

While at Interop we were on the lookout for service interruptions. One that we noticed occurred with Salesforce, popular with sales and BD folks at the show. We noticed two periods where Salesforce was unavailable from InteropNet, each lasting up to 10 minutes in length.

Salesforce unavailable due to ISP issues

Figure 3: Salesforce.com is unavailable from InteropNet locations in Las Vegas and San Francisco.

The lack of Salesforce availability corresponded with high levels of packet loss and latency on the path between InteropNet and Salesforce. Packet loss averaged 57% and latency jumped to 2 seconds.

salesforce high packet loss and latency

Figure 4: Salesforce availability is impacted by high packet loss and latency.

When we drilled into the path visualization between InteropNet and Salesforce.com we immediately saw the culprit. InteropNet’s primary ISP, CenturyLink, peers with Comcast Business Network en route to Salesforce data centers in California. The two spikes in packet loss coincide with traffic dropping on the San Jose edge between CenturyLink (Qwest) and Comcast.

salesforce packet loss in centurylink

Figure 5: From InteropNet on the left, packets are being lost en route to Salesforce.com as they transit from CentruyLink to Comcast Business.

We reverse back in time by 30 minutes to see how these nodes were performing when availability was unaffected. At this point we see that CenturyLink and Comcast Business are peering in San Jose without issue.

Route of salesforce to InteropNet

Figure 6: Under normal conditions traffic transits the same CenturyLink and Comcast Business nodes en route to Salesforce.com NA1, on the right.

From this information, we can conclude that the two service interruptions of Salesforce.com on InteropNet were caused by changes occuring at the peering point between CenturyLink and Comcast in San Jose. In this particular case, the network hiccups occurred when most attendees were not likely using the show network. But having visibility allowed the InteropNet team to monitor for problems as they arose throughout the week.

Interop Show Network: Viewing InteropNet’s Autonomous System

We also monitored Border Gateway Protocol (BGP) routing to InteropNet over the course of the conference in order to gain visibility into any routing issues that might occur. BGP defines the preferred routes that traffic will take from networks around the internet to InteropNet, as identified by Interop Show Network Autonomous Systems (AS 290 and 53692). These two networks have routes via CenturyLink (Qwest) (AS 209), the primary ISP, to the rest of the Internet.

InteropNet BGP AS

Figure 7: The Interop Show Network (AS 290 and 53692) is connected via BGP routes to CenturyLink (Qwest) (AS 209), the primary ISP, which then peers with dozens of networks to make InteropNet reachable to locations around the globe, in green.

It’s a Wrap

By now InteropNet has been torn down, only to be rebuilt again next year. In the end, InteropNet performed beautifully. Performance to key applications was speedy. Service interruptions were minimal. We had a blast helping out the InteropNet team build a network from scratch!

The post Visualizing the Performance of InteropNet appeared first on ThousandEyes Blog.

In this blog post we’re going to compare the parsing and serialization efficiency of three different C++ JSON libraries. We are carrying out this performance test because we use JSON-RPC in some of our applications. The volumes of data we handle are quite large and we need to be as efficient as possible when processing them.

Libraries to Test

JsonCpp: JsonCpp is among the most popular libraries for JSON parsing and manipulation and serialization. It currently has an average of 1,000 downloads per week from their sourceforge page. One of the strongest points of this library is its intuitive and easy to use API and complete documentation hosted on the library’s site.

For the benchmarks we are going to use the trunk version at revision 276 of the svn repository.

Casablanca: This is more than a JSON library, it is a SDK for client-server communication developed and maintained by Microsoft. One of the cons of the Casablanca library is its immaturity. The first version was published on Jun 27, 2012. The library is actively being developed which is always a good sign. We are going to use version 2.0 for our benchmarks, this is the first stable release they have made and it was released on Mar 20th, 2014.

As Boost is a dependency of this library, we will use the latest stable version of Boost, 1.55.

JSON Spirit: This is a JSON manipulation library based on the Boost Spirit parser generator. JSON Spirit is a mature library whose first version was published on Aug 10, 2007 and has been regularly updated since. It is hard to tell whether the library is currently being maintained due to the fact that there is no code repository publicly available. The documentation is scarce but consistent and clear.

It provides two implementations for the JSON objects: using maps and vectors. Each of them has its pros and cons but at least both implementations are available. For our benchmark purposes we will test both implementations of the latest library release to date, which is version 4.06. The Boost library version used to compile it will be Boost 1.55.

Testing Environment

All the benchmarks were executed in a host with the following properties:

Amount of Memory: 8 GB 1333Mhz
OS: Ubuntu 12.04 64bit
Processor: AMD Phenom II X4 925

All the tests and libraries were compiled using g++ version 4.8 with the following flags:

-std=c++11
-O3

Benchmark Tests

For each of the libraries we will analyze the parsing and serializing speed of three objects that will vary in size. In all of the cases the keys of the dictionaries are 16 byte strings.

Small Object Test: The small object consists of a JSON object that will contain 100 elements of different types inside it, it is nested up to 3 times, the nested objects have a maximum size of 10 and the lists contain up to 25 objects. The content of the file was generated randomly and you can download the data on Github. Its size is 800 KB.

Medium Object Test: In this test, the object has 5000 key-pair values of mixed types following the same nesting, inner object size and list size restrictions. The file can be downloaded from Github; it is 8.9 MB.

Large Object Test: Here we are pushing the boundaries to test how the libraries behave in an extreme case. The object to be processed has 100,000 elements. In this case, the nesting restriction was set to two and the maximum size for the lists and nested objects was set to 5. The generated file has 25MB of data and can be downloaded from Github.

Benchmarking Methodology

In order to measure the time as accurately as possible, we are going to use one of the features available in the C++11 standard. This is the chrono library and in particular the steady_clock. According to C++ Reference,

Class std::chrono::steady_clock represents a monotonic clock. The time points of this clock cannot decrease as physical time moves forward. This clock is not related to wall clock time, and is best suitable for measuring intervals.

For each of the benchmarks, we are going to take the time immediately before executing 1,000 times the action to be tested. Once the loop has finished, the current time will be taken again. We obtain the average time of the tested action by dividing the interval between the two measurements by the the number of times the action was executed. The unit that we’ll be using for all the measurements are milliseconds.

In order to reduce the I/O subsystem interference, in the parsing benchmarks the file will be loaded into a string and parsed from there. For the serialization tests, the generated strings are not saved into files for the same purpose.

The source-code of the benchmark-runners can be downloaded from Github here.

Parsing Benchmarks Results

The following are the results for the parsing tests. The lower the amount of time it takes to process, the better the library.

It is clearly visible who the winner is in this case: JsonCpp. The difference between this and the second fastest library, JSON Spirit’s vector implementation, was 50%. Casablanca was the worst performer at this test, taking almost 2.5 times more than JsonCpp and 35% more than the nearest competitor.

The medium dictionary parsing benchmark results show similar results as the last benchmark. JsonCpp has still an advantage over the rest. Although this time the difference between JsonCpp and JSON Spirit’s vector implementation is tighter, only 16%. The difference between Casablanca and JSON Spirit is the same as when parsing the small object.

The large object benchmark shows that when dealing with a large number of keys the JSON Spirit vector implementation is faster than the JsonCpp one. This is to be expected as inserting in the back of vector is faster than an insertion in a map, the underlying structure for JsonCpp.

Serialization Benchmarks

From the serialization benchmarks we can tell that JsonCpp and Casablanca are tied. For small objects, both libraries have the same efficiency at serializing. When dealing with the medium-size dictionaries, JsonCpp has a slight advantage of around 8%. But in the large-object serialization test, Casablanca is the winner with an advantage of 7.6%.

JSON Spirit’s map serialization speed was the one that performed the worst, particularly in small and large objects. It is 20%-30% slower than JsonCpp and Casablanca.

Based on the results of the benchmarks, we can state that JsonCpp is the best option for general JSON usage. It has performed better than the other two libraries for both parsing and serializing. The downside of using jsonCpp is that you need to use the repo version and not the release version because it’s out of date and several improvements have been made in the development branch. If you have any questions or came up with a different conclusion, let me know in the comments section below.

The post Efficiency Comparison of C++ JSON Libraries appeared first on ThousandEyes Blog.

Border Gateway Protocol (BGP) is a key component of Internet routing and is responsible for exchanging information on how Autonomous Systems (ASes) can reach one another. When BGP issues occur, inter-network traffic can be affected, from packet loss and latency to complete loss of connectivity. This makes BGP an important protocol for network operators to be able to troubleshoot.

Using BGP route visualizations from real events, we’ll illustrate four scenarios where BGP may be a factor to consider while troubleshooting:

Peering changes
Route flapping
Route hijacking
DDoS mitigation

Peering Changes

One common scenario where BGP comes into play is when a network operator changes peering with ISPs. Peering can change for a variety of reasons, including commercial peering relationships, equipment failures or maintenance. During and after a peering change, it is important to confirm reachability to your service from networks around the world. ThousandEyes presents reachability and route change views, as well as proactive alerts to help troubleshoot issues that may occur.

An example of a peering change in action is with Github and their upstream ISPs. In Figure 1, we see that the peering relationship between Github (AS 36459) changes, as routes are withdrawn to Level 3 (AS 3356 and AS 3549). This is important data when tracking down network performance, which can be adversely affected by major or frequent peering changes.

BGP Peering Change

Figure 1: Github, in light green, has four upstream ISPs (in dotted blue): NTT America (AS 2914), Comcast (AS 7922) and Level 3 (AS 3356 and AS 3549). In this view we see Github changing peers, withdrawing routes to Level 3 (dotted red) and keeping only NTT and Comcast as peers.

Route Flapping

Route flapping occurs when routes alternate or are advertised and then withdrawn in rapid sequence, often resulting from equipment or configuration errors. Flapping often causes packet loss and results in performance degradation for traffic traversing the affected networks. Route flaps are visible in ThousandEyes as repeating spikes in route changes on the timeline.

While monitoring Ancestry.com, a popular genealogy website, we noticed a route flap with their upstream providers. In this case, shown in Figure 2, the route flap with XO Communications lasted for about 15 minutes and disrupted connectivity from networks such as GTT and NTT while others such as Level 3 and Cogent that peered with American Fiber had no issues. For an in-depth look at this route flapping event, read our post on Monitoring BGP Routes with ThousandEyes.

BGP Route Flapping

Figure 2: The network for Ancestry.com (AS 36175) flaps routes for a short period of time, disconnecting from one of its primary ISPs, XO Communications (AS 2828), while remaining connected to another, American Fiber (AS 31993).

Route Hijacking

Route hijacking occurs when a network advertises a prefix that it does not control, either by mistake or in order to maliciously deny service or inspect traffic. Since BGP advertisements are generally trusted among ISPs, errors or improper filtering by an ISP can be propagated quickly throughout routing tables around the Internet. As an AS operator, route hijacking is evident when the origin AS of your prefixes changes or when a more specific prefix is broadcast by another party. In some cases, the effects may be localized to only a few networks, but in serious cases hijacks can affect reachability from the entire Internet. You can set alerts in ThousandEyes to notify you of route changes and new subprefixes.

In early April 2014, Indosat, a large Indonesian telecom, incorrectly advertised a majority of the global routing table, in effect claiming that their network was the destination for a large portion of the Internet. The CDN Akamai was particularly hard hit, with a substantial portion of its customers’ traffic rerouted to the Indosat network for nearly an hour. We can see this hijack play out with a test for one of Akamai’s customers, Paypal. Figure 3 shows the hijack in progress, with two origin ASes, the correct one for Akamai (AS 16625) and the incorrect one for Indosat (AS 4761), which for approximately 30 minutes was the destination for 90% of our public BGP vantage points. While this hijack was not intentional, the effects are nonetheless serious.

BGP Route Hijacking

Figure 3: Paypal’s routes are hijacked when Indosat (AS 4761) incorrectly advertises routes that divert traffic from the proper network belonging to Akamai (AS 16625), affecting a majority of networks.

DDoS Mitigation

For companies using cloud-based DDoS mitigation providers, such as Prolexic and Verisign, BGP is a common way to shift traffic to these providers during an attack. Monitoring BGP routes during a DDoS attack is important to confirm that traffic is being routed properly to the mitigation provider’s scrubbing centers. In the case of DDoS mitigation, you’d expect to see the origin AS for your prefixes change from your own AS name to that of your mitigation provider.

We can see this origin AS change in a real example from a global bank that was subject to a DDoS. In Figure 4 we can see the routes to the bank’s AS are withdrawn and new routes to the cloud-based DDoS mitigation vendor are advertised. The process then happens in reverse at the end of the attack when mitigation is turned off. Read more about Using BGP to Reroute Traffic During a DDoS attack.

Reroute BGP DDoS Attack

Figure 4: A global bank uses BGP to reroute traffic from their own AS (white circle) to that of their DDoS mitigation provider (green circle), with route changes in red.

Learning More

Monitoring and BGP troubleshooting is a crucial part of managing most large networks. Visibility of BGP route changes and reachability is a powerful tool for operators to correlate events and diagnose root causes. For more information about tracking and correlating BGP changes with ThousandEyes, check out our on-demand webinar, Visualizing and Troubleshooting BGP.

The post 4 Real BGP Troubleshooting Scenarios appeared first on ThousandEyes Blog.

UltraDNS DDoS Outage
Yesterday we woke up to alerts going off across a wide range of web services. In some cases, ThousandEyes employees weren’t able to access tools we use internally, such as RingCentral and Salesforce. We knew something big was up and dug into our tests to find out what was going on. Here’s what we saw and how we tracked the unfolding situation.

Alarms Go Off

Starting at 8:15am on Wednesday April 30th, service availability alerts started going off across a range of services that we track, including: ServiceMax, RingCentral, Veeva Systems and Salesforce. While these services were still generally available to users, particularly those with active sessions, new logins were in some cases affected.

A look at the HTTP Server view in Figure 1, showing service availability for ServiceMax, a field service management SaaS, shows the issues beginning. Our agents, which pull ‘fresh’ non-cached DNS records for performance tests, show DNS resolution failing for over 60% of locations. This view, combined with similar ones for other affected services, clued us in to a widespread UltraDNS issue.

UltraDNS DDoS outage affects ServiceMax

Figure 1: ServiceMax, hosted on Salesforce’s CloudForce platform, saw availability issues resulting from the UltraDNS DDoS for over 12 hours, from 8:15am to 9pm Pacific. Here shown at 9am Pacific.

UltraDNS Outage

It quickly became apparent that service interruptions were related to an outage by UltraDNS, a DNS service offered by Neustar that powers a number of important web services and applications, including ServiceMax. We tracked this by diving into the DNS Server view, which gives us an understanding of how many authoritative name servers are available and resolving the hostname.

Figure 2 shows the authoritative name servers for ServiceMax, the same that are used for Salesforce as ServiceMax is hosted on the Salesforce platform. For several hours, a majority of the DNS servers were unable to resolve hostnames, and those that were saw up to a 10X increase in resolution time.

UltraDNS outage ServiceMax Salesforce

Figure 2: The DNS servers for ServiceMax rely on UltraDNS. During the worst parts of the outage, around 10am Pacific, we found more than 90% of requests were failing, including 100% of those with UltraDNS domains.

We see a similar issue with RingCentral, which also uses UltraDNS, in Figure 3.

UltraDNS DDoS Outage RingCentral

Figure 3: DNS for RingCentral’s VOIP gateways are impacted for nearly 3 hours.

Looking further, we can see from a network metrics view that there is high packet loss occurring en route to UltraDNS from all of our agent locations. Figure 4 shows more than 50% packet loss to UltraDNS servers and UltraDNS hosted servers, such as the one for ServiceMax and Salesforce.

UltraDNS DDoS Outage

Figure 4: High levels of packet loss occurred from around the world when reaching UltraDNS servers. The DDoS attack was most intense between 8am and 7pm Pacific, though some servers, like the one above, were only affected for a subset of the time.

DDoS Fingerprints

Looking further into the situation, we can see that the outage was actually being caused by a DDoS attack on UltraDNS. We are tipped off about this from the sudden, severe and widespread packet loss that we saw in the previous view. To validate this we can use a path visualization of packets from our agents to UltraDNS servers.

Figure 5 reveals the DDoS attack, with traffic flowing through scrubbing centers (highlighted with dotted lines) that filter out attack traffic. One scrubbing center appeared to be performing well (blue circle), enabling DNS resolution from the Western US and international locations. Another (red circle), is causing significant packet loss and DNS resolution problems for Eastern US locations. UltraDNS has confirmed that this was indeed a DDoS.

UltraDNS DDoS Outage Path Visualization

Figure 5: A path visualization to UltraDNS servers (light green on right) shows traffic transiting scrubbing centers, one operating normally (blue dotted circle) and one causing significant packet loss for Eastern and Central US locations (red dotted circle).

Troubleshooting DNS and DDoS

All in all, the UltraDNS outage impacted customers for up to 13 hours, from 8am to 9pm Pacific. With DDoS attacks becoming ever more powerful and creating large-scale disruptions, it is important to monitor your key services such as DNS. Tools such as DNS Server tests and path visualization help you keep an eye on unfolding service outages to plan a proper response. If you’re interested in learning more about how DDoS attacks affect service availability, check out previous posts on Visualizing Cloud-Based DDoS Mitigation and Using ThousandEyes to Analyze a DDoS Attack on Github.

The post UltraDNS DDoS Affects Major Web Services appeared first on ThousandEyes Blog.