Welcome to Week 4 of ArcGIS Hub’s Civic Analytics Notebook series. If this is your first encounter with this series, you can start with our introductory post to get started and explore our offerings from the previous weeks. In our notebooks from last week, we saw how we can quantify relationships and patterns in civic data. We used the scipy
library of Python to numerically check if attributes in our data exhibit a correlation with each other, i.e. how much changes in one variable correspond with changes in another. In the second analysis notebook, we looked at two different techniques for identifying clusters in our data using the ArcGIS API for Python and the scikit-learn
libraries of Python.
This week we look at text data from surveys/feedbacks, specifically in the form of comments or additional information provided by the respondents. The goal of this notebook is to demonstrate five different text analysis techniques in Python to understand the common themes in a large number of public comments for a city/regional project without having to read through all the responses. Intrigued? Read on to learn what we found.
What can I quickly learn about responses to my survey?
In this notebook we work with the Vision Zero street safety survey from Washington, DC. This data is collected through a web application where the public can select a particular street segment and convey their concerns about its safety. We read this data in using the ArcGIS API for Python. Following that we use four Python libraries that assist us with the text processing and analysis – WordCloud
, nltk
, textblob
and spacy
. Using WordCloud
, we first create a word cloud of the most popular words in the survey. We then import nltk
to identify the words of high frequency and high relevance to the survey and regenerate a word cloud. This gives us a quick visual snapshot of what people are speaking about the most. Having done that we extract the most popular words mentioned, that suggest the topics of importance in the survey.
We proceed to calculate the sentiment score for each comment, ranging from -1 (negative sentiment) to +1 (positive sentiment) using textblob
and visualize the distribution of the scores. This gives us a general sense of what the citizens feel about the safety of the streets. We also extract the top 10 positive and negative comments, based on their sentiment scores to get a sense of the comments with strong opinions. We then conclude with a final technique that uses spacy
to identify the named entities or proper nouns mentioned in these comments and classify them to identify if they are names for a person, place, organization, etc. This is a useful technique to quickly extract the subject and focus of a comment.
While these techniques are powerful to apply to any kind of survey to quickly summarize and decipher the volume of human inputted text, they have other uses too. These are powerful techniques to transform text as we understand it (sentences) to a format which is more machine friendly. Results of these approaches can then be easily structured in a way suitable for classification/prediction problems that adopt machine learning and deep learning algorithms. We will take a look at a few examples in the coming weeks.
Explore these tools with responses from surveys or public feedback forms made available by your local Hub. Using ArcGIS Hub and Survey123 you can use survey forms to get public feedback, crowdsource data, engage the community to participate in local initiatives by voicing their thoughts. Download and add this notebook to your ArcGIS Online organization to work with your survey data. Also, feel free to share your thoughts and results from your text data exploration experiments with us on our Geonet discussion thread. I am looking forward to connecting with you on your findings with exploring text data using Python.
Link to notebook – Exploratory text analysis of comments from surveys
Click here to learn more about Week 5 of Civic Analytics with ArcGIS Hub
Article Discussion: