A hyperscale data center needed both a central monitoring and management platform and a solution to standardize maintenance and procedures. The trick was that existing systems were already in place, and it was not feasible to rip and replace. Beyond this the system needed to be expandable over time to not just more data-halls at that location; but a unified solution was desired for new construction at other geographic locations as well. Radix IoT and Mango fulfilled these complex needs with flying colors.
“I encourage anybody to take a look at Radix IoT to solve complex multi-site maintainability challenges. It allowed us to quickly deploy a solution to unify what was a mess of different tools into a single and easy to operate toolset that can be accessed locally, or from our network operations center on the other side of the world.”
The idea behind the monitoring and alerting system was to have a common, standardized way to manage and display vital metrics as well as notify staff when an issue was present in any of the dozen subsystems running. The key was to get local staff knowledge of a problem fast, while also using the same dataset remotely to allow for predictive analytics models to foster preventative maintenance practices across the portfolio. The solution needed to be flexible and integrate easily with all kinds of existing equipment, systems and devices–while offering a consistent look and feel to allow for easy operation by anyone in the maintenance organization. Management like this is extremely difficult at hyperscale. The system needed to balance what would be considered complex data models, and yet at the same time be nimble enough to handle changes in SLAs which historically demanded complex reprogramming to achieve differing proof of service that would stand up to an audit.
Mango OS was deployed in this situation at each location, and then centralized back to a cloud instance for overarching monitoring and control. Though many views were provided to allow for the required insight to different stakeholders, the primary architecture afforded the customer:
a general health overview for management,
a site drill-down metrics and alarms view,
data hall visualization views for audit purposes,
a deep drill-down view to any particular device for troubleshooting and diagnostics
General Health View
Shows the operational behavior of a single site, or operationally the portfolio as a whole (all the sites or a subset of the sites). Metrics include power profiles and efficiency, overall health, existing/active alarms sorted by severity and a summary defining the readiness of the system for compliance proof for customers.
The site views allow both an overview, but also refined real-time and custom historical data for each site pertaining to equipment alarms, as well as derived alarms allow the customer to define conditions involving several pieces of equipment and logical conditions that warrant attention when required. Additionally, this view allows for a 3-D model of the location, its overall health, and power information to view custom definable KPIs. Single line electrical diagrams have also been included that allow for live and interactive equipment status. Additional site dashboards can be created on the fly by the customer to show whatever information is pertinent as unique business concerns arise.
Data Hall Views
Shows the status of a single data hall – its white space, cooling system, UPSs, generators, active alarms, and KPIs specific to this room. The electrical diagram can show such specifics as the power source for each rack, and the display can be changed to show power demand, consumption, load utilization and much more. The same level of detail is possible for cooling, and UPSs – displaying alarms, temperature sensor locations, temperature ranges and alerts if they are being exceeded, status, and device issues.
Drills down to individual devices, such as a specific HVAC unit, UPS, in any data hall, and specific information on each equipment piece, such as connection status, electrical information, or alarms, with data obtained directly from the device. Additionally this view allows for tunnels to these devices to allow remote security access to individual devices' diagnostics or setup parameters.
These specific views could not be achieved with a standard BMS, and are fully customer customizable, per user, per hall, etc. A refined user management system that is integrated with the customers' Azure SSO system allows the customer to easily allow employees to view the information they need for their specific roles – and no more or less. As each data hall generates between 20,000 and 30,000 points, Mango is currently handling 134,000 points and is expected to handle 300,000 in the next year.
Work order and task scheduling are also being addressed by the Radix IoT solution. While the customer had an application that provided checklist management, its limited functionality and capabilities didn’t support their current business needs. They needed a customizable solution to support their unique requirements, specifically data center management, that was adaptable enough to meet their growing and evolving needs. The solution also had to include the ability to connect with a variety of existing other software systems including: ticketing, work orders, maintenance scheduling tools, and performance tracing.
The operations portion of the solution allows for facility planning and creating reproducible procedures to manage staff and employees and workload – all while keeping their technicians safe and productive, and their internal customers informed and reassured.
The solution also includes templates for procedures based on the data collected. An alarm in the management system can trigger a work order in the operations system, while the task list can be customized for the individual job. If these checklists are followed carefully, they eliminate errors and can be reused and not reinvented each time they are needed–and can easily scale for future use. Technicians have a step-by-step checklist for all maintenance procedures, including what to do if something goes wrong, and if/then scenarios, keeping the process standard and the technician safe, since the data center’s dangerous electrical equipment must be handled carefully and responsibly.
The Operations platform can be loaded onto mobile devices and allows technicians easy access to the procedure checklists; the work details, from required tools to how much time it’s expected to take; and allows the organization to track the compliance with these standards. These procedures also take into consideration who should be assigned to what tasks, and whether approval or permission from a supervisor or the customer must be attained before the job is done. Work can be tested, reviewed, verified, approved or sent back for revision–and finally, when the workload is completed and closed, the whole process is documented for compliance reporting.
The new solutions gave the customer additional capabilities not previously possible:
A single pane of glass view tailored to job roles.
Health metrics allow views at a quick glance of performance characteristics by portfolio, site, hall, or other subsets.
Ability to remotely triage down to single components.
The flexibility to adapt the system to new locations or reporting requirements without costly and time-consuming reprogramming.
Overall, this has allowed the customer to respond to issues in a shorter period of time, but more importantly increate uptime and offer a higher availability performance metric to customers, decrease unscheduled downtime, and allow for tangible proof of performance for all their customers.
The Radix IoT solution has already simplified what was a highly complex set of solutions in the past. The solution continues to grow and is capable of evolving with little effort to the client’s needs and requests. Though it already integrates with a substantial number of systems, the customer is already expanding this integration after seeing the ease and flexibility.