A system can be like the tower of blocks built when playing a game of Jenga. Prior to the start of the game, the tower is an orderly stack of blocks. Over the course of the game, blocks are removed from the tower and placed on top. As blocks are removed from the bottom and added to the top, the structural integrity of the tower is compromised, leading to an inevitable collapse.
This is analogous to the situation with GIS when new solutions and technology are added to an existing system. New components can be added at the expense of stability. One component, like a missing block in a Jenga tower, can topple the entire system.
Maintenance isn’t exciting, but having a maintenance plan for a system can be the difference between a well-oiled machine and a catastrophic failure. At its core, a maintenance plan is a set of procedures that are used to keep a system running optimally. These procedures include components like asset inventories, audits, monitoring, upgrades, and tools. Every time one of these components is utilized, a block is put back into your Jenga tower to restore its stability.
A maintenance plan ensures that new components added to the system don’t destabilize existing components. Mitigating a disaster is always cheaper than recovering from one. A system maintenance plan is perhaps the most effective tool a GIS administrator can use to prevent a disaster from happening.
Stark County, Ohio, is home to more than the Pro Football Hall of Fame and the final resting place of William McKinley, the 25th president of the United States. The county, with a population of nearly 375,000, has a robust GIS that includes ArcGIS Enterprise (portal, ArcGIS Server, multiuser geodatabases), ArcGIS Online, ArcGIS Desktop, automated processes that use ArcGIS GeoEvent Server and Python, and other supplemental servers and software.
The Stark County GIS Department (SCGIS) provides services and support for dozens of departments across Stark County and its cities, villages, townships, and—most important—its residents. In the past five years, the department’s footprint has grown significantly, with services increasing by roughly 10 percent and web solutions increasing by more than 300 percent.
In response, the team added new staff members in each of those five years. The team recognized that offerings were expanding at an unsustainable pace. The response was the creation of a sub-team to focus on system maintenance and put together a plan for keeping SCGIS running smoothly. GIS systems analyst Joe Guzi and data engineer Brandon Wise wanted to start with manageable goals. Instead of trying to create an all-encompassing plan from the start, they wanted to test a proposed framework for auditing the system on a single issue.
Although SCGIS had been using dozens of Python scripts to automate processes for years, it didn’t have a catalog of what each script did and when it ran, which was a large source of unpredictability. Arguably the most important part of a maintenance plan for any system is an asset inventory. It is impossible to make informed decisions about the health of a system without one.
In what became their first Python audit, Wise and Guzi created an inventory of these scripts. They documented each script and used that documentation to create a code base to standardize each script during the subsequent Python audit. Standardization, particularly in GIS, is a crucial element for mitigating unpredictability. Wise and Guzi also added error-reporting notifications to alert the team when scripts run unsuccessfully. The framework the team used for the Python audit worked and was ready to be extended to other components of the system. The timing couldn’t have been better, because another part of the system was going to prove what happens in the absence of audits.
SCGIS has two production servers: a public-facing one and another for secure services. Each ArcGIS Server installation has a limited number of instances based on machine hardware. Despite multiple servers, one of those machines was at capacity, so the team initially thought it would need to install another server.
Luckily, the team decided to audit the servers prior to making any purchases. After conducting an inventory and evaluation, Wise and Guzi were able to greatly reduce the total instances on each server, which eliminated the need for another installation. Four years later, SCGIS can host twice as many services on those same two installations because of biannual audits of ArcGIS Server.
Audits like these are a major part of maintaining the system. In addition to audits of Python scripts and ArcGIS Server (Enterprise), the team conducts six additional audits on an annual or biannual basis, depending on the importance of the component. Those audits cover Microsoft SQL Server, GeoEvent Server, custom geoprocessing tools and services, and GIS users. While these audits are extremely useful, they won’t identify problems as they happen. This is where passive monitoring comes into play.
The team began using ArcGIS Monitor to view real-time feedback from the department’s servers. Monitor allows GIS users to view the health and performance of their system. As Python error reporting alerts team members to when scripts aren’t working, Monitor can send alerts based on custom parameters to indicate that part of the system is running poorly. It also allows users to see performance over time to identify problems.
During the height of the COVID-19 pandemic, SCGIS used Monitor to diagnose problems with GeoEvent Server, which was being used to automatically update contact tracing information for the health department. Monitor allowed the team to pinpoint and fix problems.
The team also conducts a monthly review of the GIS servers, which Guzi refers to as the high-level system check. Each month, Wise and Guzi review warnings and errors from the server logs to look for potential issues. They worked with Esri to better understand these items so they can determine which ones are cause for concern.
Think of these monthly reviews as routine blood work: they may not identify exactly what’s wrong, but they can be an indicator of a larger problem. Identifying patterns helps the team better understand the system overall, which has reduced the likelihood of unpredictable events occurring.
In addition to using ArcGIS Monitor, Wise and Guzi developed a custom tool to generate reports showing the interconnectedness of system components. They call this tool the ArcGIS Enterprise Lineage. Python scripts are used to create combinations of data at regular intervals that are viewed using SQL Server Reporting Services (SSRS). These reports allow team members to reference which system components affect others. For instance, if Wise wants to see every service that uses a feature class from a particular dataset, all he has to do is select the dataset in the report and it generates a list of those services. Instead of relying strictly on inventories, Lineage allows team members to generate dynamic reports that provide an intimate view of the system.
A maintenance plan doesn’t just include procedures for maintaining the system. It also includes procedures for changing or updating it. Adding new components or technology with reckless abandon can topple the system. For routine updates, like those for Windows Server, consistency is key.
Using that as an example, Windows Server requires regular updates to prevent security vulnerabilities. Each month, Windows Server is updated at the same time (during off-hours) using a set procedure that prevents disruptions to users and other system components.
With larger upgrades, such as moving to a new version of ArcGIS Enterprise, the process is more involved. Wise and Guzi developed an upgrade procedure as part of the system maintenance plan. The team starts with a risk-reward analysis to determine if a new offering’s benefits are worth the risks to the existing system. To allow adequate research time for each release, SCGIS limits ArcGIS Server upgrades to one per year.
The team schedules these upgrades during low-usage times, such as late December, and avoids times when there are important events such as elections. Upgrades are always deployed on staging servers first to evaluate mission-critical workflows. These mission-critical items were identified by key users across the county to create a list that is utilized during testing.
If any of them fail, the upgrade is reverted and is not performed on production servers. If the staging server passes the testing phase, the production servers are upgraded following the same testing procedure. Upgrades only occur off-hours, so that even if a mission-critical item fails, there is no disruption to the user.
The common thread in each maintenance plan component is documentation. SCGIS uses Confluence—a wiki-esque, web-based software from Atlassian—to document everything about the system, from design to maintenance procedures. Documentation is not only vital to replicating procedures; it also helps mitigate contingencies such as the departure of a team member. If Wise and Guzi were to both leave the department, their documentation could be used by other team members to replicate system maintenance procedures.
No matter the size, all GIS implementations require maintenance. As a system’s size grows, so does the chance of uncertainty. Developing a comprehensive maintenance plan is a necessary measure to ensure system integrity. GIS administrators without a plan don’t need to panic. Wise and Guzi recommend starting with small, achievable goals.
Inventorying system components is a great place to dive in because understanding the structure of a system is important for making good decisions. Once assets are inventoried, brainstorm ways to improve them, such as developing standards. The key here is an iterative approach that allows for gradual improvement over time at a schedule that works for the organization. Asking questions like, What am I missing? or What if this happens? will help change the way administrators view their systems. The plan also doesn’t need to be perfect from the start. The plan and its procedures will evolve as new technology, methods, and other contingencies are introduced.
Ultimately, having a plan in any form is better than not having one at all. “By failing to prepare, you are preparing to fail,” is an apt aphorism that is often attributed to Benjamin Franklin. Preparing a maintenance plan is one of the best ways to keep a system from failing.
For more information, contact the Stark County GIS Department by email at gis@starkcountyohio.gov.