- Posted by Brendan Caulfield, Turing Group COO
- 0 Comments
- Turing Group News
Encompassing the greater Chicagoland area, Metra is one of the largest and most complex regional commuter rail systems in North America.
Serving millions of commuters each month—and managing such an extensive fleet of trains—is expertly handled via an intricate data landscape.
With 241 stations, over 11 routes, and approximately 1,200 miles of track throughout 6 counties in northeastern Illinois, Metra operates more than 700 weekday trains, providing about 300,000 passenger trips each weekday.
The project required Turing Group to build a responsive infrastructure, a modernized e-commerce ticket sales platform and leverage Metra’s extensive General Transit Feed Specification (GTFS) data system. Metra’s GTFS data is used to manage and monitor the entire train fleet to support Metra riders by providing them with more relevant and timely schedule information and system alerts.
With the advent of new technologies and standards, Metra wanted to enhance their rider experiences by developing a new, comprehensive website that made getting pertinent information quick and easy. Inclement weather, track and station construction, heavy usage during peak hours and special events contributed to an ever-growing frustration amongst Metra riders because of the inability to get accurate and real-time information on the trains and their locations.
“Our goal was to create a customer-friendly website that presents information in a logical and intuitive way. We hope these changes will make using our website—and using our train service—an even more satisfying experience.”
– Don Orseno, Metra Executive Director/CEO
“Metra debuts new and improved metrarail.com”
metrarail.com, June 29, 2016
A significant problem with the Metra website, and its infrastructure, was the way in which GTFS data were delivered to riders via a browser-based pull mechanism, rather than modern server-based push technology such as WebSockets. Alerts were not presented within the context of their respective train or route and were, instead, provided in a manner that caused riders to scroll through an often lengthy list of notices to find any that may have impacted their particular train or route.
The new Metra system needed to be able to handle massive spikes in traffic to their website during special events and severe weather. Scaling of the site could be scheduled in advance, but unforeseen events, such as inclement weather or mechanical failures, left them vulnerable to unpredictable spikes. Due to the elastic nature of Amazon’s Cloud offering (AWS) and Metra’s ridership, the Metra project was a perfect opportunity to utilize much of the functionality already available in cloud offerings, such as that available from AWS.
As an AWS Managed Services Partner (MSP), Turing Group was brought in to leverage our expertise in AWS’ vast functionality, implementation, and management. Turing Group needed to architect an environment that would expand and contract automatically with demand and load. Designing this type of environment can be challenging, as it requires developers to understand the elastic principles of the cloud and to create software solutions that can handle the variable nature of auto-scaling scenarios, where servers often appear and disappear as they ramp up or ramp down the number of cloud server instances.
Because a website crash would be disastrous, the site could not afford downtime and the system would need to be highly available and fault tolerant. Also, it was vital that development and deployment were straightforward, consistent, reliable and non-impactful on site availability.
When complete, Turing Group needed to be able to hand off the environment to Metra, ensuring that their software and systems engineers, and those of their partners, had the ability to update code, content and live GTFS feed data promptly to improve rider communication.
Finally, because the website would process credit cards for ticket sales, Turing Group and Clarity Partners collaborated to achieve Payment Card Industry (PCI) compliance, which verified, via audit, that Metra and its partners take all available precautions to provide security standards and multiple layers of defense to protect sensitive consumer data.
Because websites and the infrastructures that run them do not scale to meet demand automatically and are not fault-tolerant by default, Turing Group had to take care when designing the environment. To accomplish these two critical goals, the environment was designed from the start to utilize multiple Availability Zones within the AWS Cloud. This means hosting the site across multiple physical data centers. This solution required the use of both CloudFront (AWS’ Content Distribution Network offering) and Elastic Load Balancers (ELBs). Also, a well-architected auto scaling policy was critical to allow the environment to grow and shrink based on utilization. Auto Scaling helps maintain application availability and allows you to scale your Amazon EC2 capacity up or down automatically according to conditions you define. Auto Scaling can be used to help ensure that you are running a desired number of Amazon EC2 instances, and can also automatically increase the number of Amazon EC2 instances during demand spikes to maintain performance, and decrease capacity during lulls to reduce costs.
In addition to overall management of the Metra project, Clarity Partners was responsible for the site front end design and architecture of the Drupal components, ticket sales e-commerce integration, and supporting the PCI compliance initiative.
As a DevOps focused firm, Turing Group guided the Clarity development team on how to deploy, run, and code Drupal for an auto scaling environment, helped design and deploy the Continuous Integration (CI) jobs and processes, and led triage and management of all issues.
The hosting infrastructure now consists of development, staging and production environments. Through CI tools and processes, Metra and Clarity are able to test new code and ideas in the development and staging environments prior to pushing them to production with confidence. The new DevOps oriented development process also allows for multiple development teams to contribute to the code base, including Metra’s internal development team.
To address the lack of real-time push capability of GTFS data, Turing Group built, and continues to maintain, the real-time data feed for alerts, train positions, and schedule data. Our system makes this data available via WebSockets and a standard JSON-based REST API.
Additionally, Turing Group provides direct access to the raw GTFS data feed. This access protects Metra’s GTFS source from being overloaded by too many requests. In addition to providing data to riders via the website, Metra provides GTFS data to major providers such as Google, Microsoft, Yahoo, and more, for their mapping and routing applications. The new real-time messaging platform, based on NodeJS and PubNub, is generating and delivering millions of messages a day to website users and other connected devices. To be useful, these messages need to be delivered in a reliable, scalable and timely fashion.
Figure 6: Real-Time GTFS Messages – 3 Months Trailing
(Spike on 11/4 was for the Chicago Cubs World Series parade and rally)
Metra now enjoys a fully managed platform in AWS that covers all aspects of the environment, including automated code deployments throughout the development cycle.
Amazon CloudFront is a global content delivery network (CDN) service that accelerates delivery of websites, APIs, video content or other web assets. With Amazon CloudFront, you don’t need to worry about maintaining expensive web-server capacity to meet the demand for content from potential traffic spikes. The service automatically responds as demand increases or decreases, without any intervention. CloudFront allows Turing Group to serve cached data to metrarail.com end users, rather than passing their requests directly to the web/application servers to fulfill every client request. This process allows us to significantly increase the efficiency of the Metra site and ultimately scale back on the instances/resources needed to service riders.
The positive impact of improved site performance was significant and immediate: during the month of August 2016, over 5.5TB of data was pushed from CloudFront and only 61GB of data had to be served from the web servers. Running far fewer cloud server instances and other infrastructure to service end users will also enable Metra to manage their server costs.
Amazon CloudFront passes on the benefits of Amazon’s scale. You pay only for the content that you deliver through the network, without minimum commitments or up-front fees. This applies for any type of delivered content – static, dynamic, streaming media, or a web application with any combination of these. With Amazon CloudFront, you don’t need to worry about maintaining expensive web-server capacity to meet the demand for your content from potential traffic spikes.
Figure 7: Benefits of CloudFront, Highlighted
(The spike to 500GB on 11/4/16 was in support of the Chicago Cubs World Series parade & rally)
The auto scaling solution Turing Group and Clarity developed has allowed the site to scale and respond to demand in three fundamental ways.
First, we programmed scheduled scale-up and scale-down scenarios that increase the size of the cluster before rush hour and decreases it back down after rush hour, since that is when the site is in highest demand. If full capacity is not required 24×7, it’s better to run cloud instances at peak capacity for only 8 hours a day.
Second, the site is also capable of auto scaling itself if the load on servers or the page response times get too high. This ability to dynamically scale means that Metra only runs servers when they’re needed, automatically adjusting to peak and off-peak times, which saves Metra a considerable amount of money in hosting costs. Metra also doesn’t need to try and plan for unforeseen spike conditions since the site will auto scale up and handle them as the need arises.
Finally, Turing Group is able to pro-actively scale up the Metra AWS environment in anticipation of supporting traffic spikes related to Chicago events. The first tests of this capability were the Lollapalooza music festival and the annual Taste of Chicago festival in July, 2016.
More recently, and most significantly, Turing Group managed the Metra cloud environment to support surges in ridership to the Chicago Cubs playing in the World Series games hosted at Wrigley Field, and particularly the day of the Cubs parade and rally in Grant Park on Friday, November 4. Metra Executive Director/CEO Don Orseno accurately predicted that the day of the parade and rally was expected to be the day of highest ridership in Metra history. Though the streets, train stations and sidewalks of Chicago struggled to accommodate the physical demands of nearly 5 million Cubs fans attending the celebration, Turing Group ensured the Metra AWS environment kept the site live and data flowing to provide all riders with accurate schedules and timely alerts.
Figure 8: Auto Scaling Graph for Two Weeks (# of active web servers in the cluster over time)
(Spikes on 11/4 and 11/5 were for the Chicago Cubs World Series parade and rally)
Turing Group’s proprietary mix of tools and processes were utilized from the start, along with our DevOps philosophy and approach to environment architecture. From infrastructure build and deployment to application deployment and Continuous Integration (CI), Turing Group makes sure to always operate in a way that is consistent with current best practices in the DevOps space. This approach allows for much greater efficiency, consistency and repeatability within the environments we manage – and the Metra project was no exception. We utilized tools like Ansible, CloudFormation, Git, Jenkins and others to make sure all changes to the environment were vetted, self-documenting, easily rolled back in case of issues, and well-orchestrated. As a result, Metra benefits from quicker, cleaner and more seamless deployment of their applications and cloud infrastructure.
The new Metra environment has over 28 different CI jobs that support three different teams of developers. Typically, coordinating deployments and pushes would be complicated and error-prone. Because of how we implemented CI, each development group can take control of their own deployments in a consistent fashion.
What does all this mean for Metra and its riders? More rapid deployments mean faster fixes, ensuring accurate data reaches riders. Also, there will be more frequent feature releases and, overall, a more stable environment.
The DevOps approach also contributes to PCI compliance because the automation process documents releases and deployments, and helps account for changes to the environment. With traditional, old school deployments, systems and software engineers may have had to log in to servers and manually deploy/install new versions of custom software. By using a DevOps approach and Continuous Integration tools, we eliminated the need for much of this, as Metra’s developers and engineers can simply rely on external processes to deploy new versions of software. Building solutions that keep these developers and engineers from logging into servers directly reduces the PCI burden and takes many of the access concerns out of scope.
The new system is also saving Metra money. As of the launch, the agency’s new web services provider is expected to save Metra 50 percent, or about $400,000 a year, over their previous contract. Development costs are also less, and the open-source platform means that Metra can perform both support and development in-house, saving money and ensuring timely updates to the site and its content.
"Metra is proud of our new site. We've included enhancements that directly improve our customers' ability to make the best travel decisions for themselves. For instance, the schedule finder tool has been upgraded to provide more information: customers can decide whether to view the schedule between two stops or the whole schedule for the line, and the results will show if the train is running behind schedule or if there are any other service changes affecting that train, such as a decision to add or skip stops. Innovations like these are possible because of the technical choices we've made, and those same choices will allow us to continue to innovate for our customers."
– Cherie Kizer, CIO, Metra