Do you perform online searches to look for publicly available datasets that contain the information you need for an analysis? Of course. Now, how often do you find a clean dataset with the exact information that you need, nothing more, nothing less, and in a form that is ready to use for your scenario? Probably very rarely. In this blog post, we will look at a process for data search, selection and cleanup.
Let’s say you are an analyst at a marketing firm. Your client is a university that wants to boost its enrollment from public schools in Marion County, Indiana. You are responsible for allocating the resources available to you at your firm towards outreach and promotion efforts on behalf of the university. To begin your analysis, you need the locations of all public high schools in Marion County, IN. A search on ArcGIS Hub for “public schools united states” returns several results, among which is a dataset of all public schools in the United States, shared by the Oak Ridge National Laboratory, which has been updated recently.
Click on the title Public Schools to open it.
Click on View Metadata and look for information on terms of use. Under Constraints, you are able to confirm that the dataset is in the public domain, and it would be permissible to use it in your analysis.
Next, on the Data tab, use the filter buttons in the column headers to filter the dataset to only include schools in Marion County, IN that teach students Grades 9 through 12. This will be the preliminary list of high schools your promotion needs to cover. Download the filtered dataset as a shapefile.
Next, you will examine the data. In ArcGIS Pro, add the shapefile to a map, and open its attribute table. Sort the schools in descending order of address. Of the total 58 high schools, notice that some of the schools are located extremely close to others. Use the select tool to select one such cluster on the map. You can tell that it has 3 schools, as 3 records get selected in the attribute table.
One of those schools is named “Area 31 Career & Tech Center” and has an enrollment of 0. Clearly, it is not a high school, and apparently a case of the facility being used as a career and tech center (presumably after school hours). Delete this location from the list.
The other two are “Ben Davis High School” and “Ben Davis Ninth Grade Center”. From the Start Grade and End Grade columns, it is evident that the Ninth Grade Center serves only Ninth Grade students and the High School serves Grades 10 through 12. From the map, it appears that they are part of the same facility or school building.
Since you need school locations to plan direct outreach and marketing, it makes sense to treat this as one single location high school that needs to be covered, rather than two. Merge the Ninth grade center with the high school using these steps:
- Add the Ninth Grade Center enrollment to the High School enrollment figure.
- Add an attribute to the table for “Notes” and add a Note to ensure the Ninth Grade Center is not overlooked.
- Delete the Ninth Grade Center record.
Continue cleaning up the dataset by following these steps:
- Review the other clusters and consolidate separate schools that are located at the same site.
- Sort the table in ascending order of enrollment and delete any other sites with an enrollment of 0.
- Lastly, delete or hide columns that will not be needed in your analysis – for example LATITUDE, LONGITUDE, Country, and VAL_DATE.
After these data cleanup steps are complete, the dataset has 42 high school sites.
It’s rare to find data that’s already perfectly formatted for your needs, but the work you do to prepare and clean a dataset also gives you better understanding for the analysis you’re embarking on. In another blog post, this prepared schools dataset is used to enrich a boundary feature layer for use in a territory design analysis. You can read about the analysis in the Learn ArcGIS Lesson Balance Territories for College Recruiters.
Commenting is not enabled for this article.