Governments struggle to balance privacy and confidentiality concerns against the public trust that comes from open information. Academic researchers lean toward openness to facilitate repeatability, and thus verification, of their hypotheses. Businesses recognize the competitive value of information and tend to retain it within their organizations. In the wake of increased terrorism, these stereotypical stances have shifted significantly toward privacy. However, the quantity and reach of information continue to grow. Determining which data to share on the Internet requires judgment on the part of a potential distributor. As GIS proliferates throughout government, academic, and corporate organizations, such decisions will also increasingly involve whether to distribute map layers. Understanding the effects of GIS transformations provides a necessary foundation for these decisions. Address matching procedures in GIS software take a list of addresses and convert these into coordinate locations. The resulting maps visually display geographic patterns of phenomena such as disease and crime. If coordinate locations are subsequently released, the map creator takes a risk that others may reverse the matching procedure and discover original addresses. The coordinates are essentially an encoded address list. If the creator considers the encoding procedure sufficiently secure, he or she may be willing to distribute the coordinates of private or semi-private information. The marketing advantages of mailing lists create an incentive for businesses to decode relevant point layers using reverse address matching techniques. The result may be inadvertent loss of individual privacy. Address matching encodes, but does not encrypt, addresses. Geocoded points communicate location information; encryption prevents communication except among privileged parties. Understanding Offset and SqueezeWhen performing geocoding, an address list is transformed into a set of coordinate points based upon a street reference layer. If an appropriate street segment is found for an address, two parametersoffset and squeezehelp determine the coordinate locations. Street line segments regularly represent an approximate street centerline. Offset is the distance to the left or right of a line that a point is removed from the street segment, so an offset of zero places all addresses along the middle of streets. To get a more realistic position for buildings, addresses are frequently offset a certain distance, such as 20 meters, from the street line segment. A squeeze value greater than zero squeezes points in from the ends of line segments, which avoids placing the first and last addresses on each street in the middle of a cross street. Possible squeeze values range between 0 and 100 percent. A squeeze of 100 percent places all addresses on one side of any given block in exactly the same spot. Reverse Address MatchingUntrained or inexperienced users may not readily notice the geometric and mathematic repeatability of address matching or consider the potential for reversing the process. In 1999, the Geographic Information and Society international conference included a paper, Hacking: On the Use of Inverse Address-Matching to Discover Individual Identities from Point-Mapped Information Sources, by Marc P. Armstrong and Amy J. Ruggles. Although they acknowledge that variables used in the matching process could be discovered, they suggested that increasing offset and squeeze parameters can guard against deciphering addresses in ArcInfo.
Their findings necessarily rely on assumptions made by the programmer who wrote the reversal script. For example, the reverse address matching script in ArcView assumed a zero squeeze percentage even though the default for address matching in that program is five percent. Intuitively, an offset of zero will reduce decoding accuracy because equal ranges on both sides of a street create identical pairs of coordinates along the street segment. No strictly geometric analysis can determine the correct address for coincident points with greater than 50 percent probability. The research by Armstrong and Ruggles prompted an analysis of address matching algorithms. Concern for the privacy implications of their results and interest in the process of discovering input variables led this author to consider alternate methods for reverse address matching. Specifically, she sought to take advantage of parameter information inherent in a set of points rather than considering each coordinate individually. Sets of 10 points each were selected randomly from possible addresses from TIGER street segments in two suburban ZIP Code areas outside of Buffalo, New York. One hundred sets were created for combinations of offsets of 0, 20, and 40 meters and squeeze of 0, 10, and 20 percent.
The results reinforced confidence in the ability of map hackers to determine offset and squeeze values from a small number of geocoded points. Offset was deduced perfectly in all 900 tests by calculating the most common distance between each test point and its nearest street segment. Although other methods are possible, the author squeeze detection algorithm relied on the replicable nature of address matching and computationally intensive, or brute force, programming. Using a known offset, all possible addresses on every street were plotted at one percent increments. Avenue scripts found the most likely squeeze value by evaluating the sum of square distances between test points and possible address points for the tested percentages. All 900 tests perfectly deduced the input squeeze value. (Note: Restricting squeeze to integers limits the ability to generalize this finding.) Continued on page 2 |