What good is a ZIP Code centroid if I need to know if an address is inside the city.
Search for 13145 Alpine Ridge RD in Cheyenne, Wyoming, 82009, in any on-line mapping site, like Google or Bing Maps. They will not locate it properly on a map. Google’s Geocoding API returns an accuracy level of ”geometric center.” Bing Maps returns a “PostCode1.” Pitney Bowes returns a “Z1,” which is short hand for a ZIP Code centroid. In other words, the exact location on the map of the address has not yet been digitally captured. Therefore, it is not yet possible to plot it – at least not by a machine. At best all they can do is make a rough guess as to its location, like “somewhere near Cheyenne.”
Now suppose we need to know if the address is inside or outside the city limits of Cheyenne. Suppose we need to know which (telecom) rate or wire center contains it. Perhaps we need to know if it is inside a special taxing district, or inside a particular administrative zone. If we knew the map coordinates for the address, we would do a simple point-in-polygon lookup to assign the location to a boundary. Plot the latitude and longitude on a map, and see which city or rate center, for example, the boundary falls inside. Simple. Except in this case, any latitude and longitude that we might obtain for the address would be wildly inaccurate, and therefore the assignment possibly would be way off.
If your problem is limited to knowing whether the address is within city limits, you cannot use the “city” name in the last line of the address. That’s because the city name in this context really is the name of Post Office that handles your mail. It has no hard relationship to the physical location of the address.
All is not lost though. Since we know the ZIP Code, 82009 in this case, then we can use it to carry out a boundary assignment operation. A simple and defensible programatic approach is to check to see if the ZIP Code boundary of the address is wholly enclosed by a single boundary, a Place boundary for example. If it is, then we can assign the address to the boundary with a reasonable or even high degree of certainty. People often call this approach a “fall back check.”
In SQL logic, you might express the problem as “where Place geography contains ZIP Code geography.” The SQL Server 2012 code for that kind of operation would be like so:
SELECT p.[Feature ID] ,p.[Place Name], p.[ST] ,z.[ZIP Code] ,z.[Post Office Name] FROM FederalData.PlaceGeography p CROSS JOIN Project.ZIPCodeGeography z WHERE p.GeogBoundary.STContains(z.GeogBoundary) = 1
Of course, in reality, the problem is not so black and white for a variety of reasons. To begin with, ZIP Code boundaries and the target boundaries (e.g., Place or city boundaries) often come from different vendors and are not drawn to the same level of resolution. Differences also can occur even when boundaries are snapped to a common reference line. For example, highways sometimes represent a boundary line, yet variations in the representation of the center lines of the highway can lead to slight variations in boundary lines. Further, some boundary lines simply can be somewhat arbitrary when there is no legal reason for them to be precisely defined. In unpopulated or very lightly populated areas, precise ZIP Codes lines are not needed or even relevant. Who cares whether the ZIP Code boundary line splits a corn field. So different vendors might draw the same boundary lines differently.
Add to these nuisances the simple fact that ZIP Codes are specific to the problem of mail delivery. They do not conform to, nor do they need to conform to any other types of boundaries. About the only generalization you can make is that ZIP Codes do not cross State lines. However, even that is not perfect because a handful of ZIP Codes do in fact cross state lines.
All these sources of variations mean that ZIP Codes and target boundaries are subject to some level of either partial or incidental overlap. Instead of a simple yes/no question, expressed as, “does the Place wholly contain the ZIP Code,” the question would be better framed as, “does the Place boundary mostly contain the ZIP Code.”
For the area around Cheyenne, Wyoming, the following image shows what the ZIP Code boundaries look like. The ZIP Codes boundaries are in shades of red. Quickly you can see that there is no relationship between ZIP Code and Places. More to the point, no ZIP Code is wholly inside a Place. ZIP Code 82001 covers part of Cheyenne but also extends far to the west of town. ZIP Code 82009 covers a bit of Cheyenne, and a whole lot of other areas.
For the area around Burlington, Vermont, both ZIP Code 05405 and 05408 are mostly in Burlington, and 05404 is mostly inside Winooski. Look carefully, and you can see that the Place and ZIP Code boundary lines do not agree. That’s true for two reasons. First, Burlington is on Lake Champlain, which means its boundary line follows a jagged coastline. How it is drawn is a matter of how precise the vendor wants the coast line edge to be. Second, both the Place and ZIP Code lines are highly generalized at the shoreline – very highly generalized in fact for ZIP Code 05401, extending far out into the lake. Of course, there is no mail delivery in the middle of the lake. Still, there was no analytical reason for the vendor to snap the ZIP Code boundary to a coast line.
Here is the aerial of Burlington. As you can see, ZIP Code 05401 extends quite a bit into the lake, out to a representation of the New York – Vermont border. While we do not show it here, even the representation of the Place boundary extends into the lake (hence the straight line on the southern western edge.) These sorts of generalized, or overly generalized boundaries as in this example, complicate the problem a bit but do not invalidate the approach.
Technically, its easy to figure out how much two boundaries overlap. Programmatically, it works like so:
SELECT p.[Feature ID] ,p.[Place Name], p.[ST] ,z.[ZIP Code] ,z.[Post Office Name], z.GeogBoundary.STIntersection(p.GeogBoundary).STArea() / z.GeogBoundary.STArea() AS ProportionOverlap FROM FederalData.PlaceGeography p CROSS JOIN Project.ZIPCodeGeography z WHERE p.GeogBoundary.STContains(z.GeogBoundary) = 1
The real issue, though, is what defines “mostly?” Is it 99 percent within? What about 95 or 90 percent? The answer to this question is a business decision. The level of tolerance for errors you can accept, both from the degree of overlap as well as from sources of error that result from different levels of precision in the boundaries, is specific to your business needs. If the financial cost of an incorrect answer is substantial, then it could be that no fall back check ever would be considered safe. If the cost is minor, or if the rule of “due diligence” applies, then you might be able get away with decisions based on a finding of “mostly contains” as defined by some percentage. Again though, this is a business decision, not a technical decision.
Assuming that some tolerance for less than perfection is acceptable, it is advisable to pre-calculate the amount of overlap between ZIP Codes and the target boundaries. After all, the usual subscription to ZIP Code and other boundary map products normally is for quarterly updates. So you could make the calculations once (a quarter) and be done with it. The code for precomputing the overlaps between ZIP Code and Places might look like so:
;with OverlapCandidates as ( select p.[Feature ID] ,p.[Place Name], p.[ST] ,z.[ZIP Code] ,z.[Post Office Name] ,case when z.[GeogType] not like '%Point%' then case when p.GeogBoundary.STContains(z.GeogBoundary) = 1 then 'ZIP Code inside Place' when z.GeogBoundary.STContains(p.GeogBoundary) = 1 then 'Place inside ZIP Code' when z.GeogBoundary.STIntersects(p.GeogBoundary) = 1 then 'Partial Overlap' else 'Mistake' end else 'n/a (Post Office)' end as [Overlap Type] ,round(z.GeogBoundary.STIntersection(p.GeogBoundary).STArea(),0) as [Overlap Area] ,round(z.GeogBoundary.STArea(),0) as [ZIPCode Area] ,round(p.GeogBoundary.STArea(),0) as [Place Area] from FederalData.PlaceGeography p CROSS JOIN MapInfo.ZIPCodeGeography z where p.GeogBoundary.STIntersects(z.GeogBoundary) = 1 ) ,RankCandidates as ( select [Feature ID], [Place Name], [Post Office Name], [ST], [ZIP Code], [Overlap Type], row_number() Over(Partition By [ZIP Code] Order By [Overlap Area] desc) as [Overlap Ranking], round(case when [Overlap Area] > 0 then [Overlap Area] / [ZIPCode Area] else 0 end,4) as [Proportion Overlap Area To ZIPCode Area], [Overlap Area], [ZIPCode Area], [Place Area] from OverlapCandidates oc ) select [Feature ID], [ZIP Code], [Place Name], [Post Office Name], [ST], [Overlap Type], --Business Rule case when [Proportion Overlap Area To ZIPCode Area] between 0.0009 and 0.02 then 'Incidental Overlap' when [Proportion Overlap Area To ZIPCode Area] between 0.9499 and 0.9999 then 'Dominant Place' else [Overlap Type] end as [Overlap Type, Adjusted], [Overlap Ranking], [Proportion Overlap Area To ZIPCode Area], [Overlap Area], [ZIPCode Area], [Place Area] from RankCandidates order by [ZIP Code], [Overlap Ranking];
The output of this code looks like so.
A business rule for what constitutes a meaningful degree of overlap would be applied using the column, “Proportion Overlap Area to ZIP Code.” Looking at ZIP Code 05405, 99 percent of it falls inside Burlington; same for 05408. By business rule, Burlington is considered the dominant Place. Therefore, if our geocoder returned a Postal Code geocode type for an address and it’s ZIP Code was 05405 then we would conclude the address to be inside Burlington, even if we do not know exactly where inside.
Going back to our Cheyenne example, very little of ZIP Code 82009 (less than four percent) overlaps the city of Cheyenne. By business rule, that’s considered a “Partial Overlap” albeit a minor overlap. So in this case, we could not conclude that the address is in the city, based on the ZIP Code alone.
Moving back to Burlington, 42 percent or so of ZIP Code 05401 falls inside Burlington. We know from the map images shown above that the actual coverage is better than that. However, from the numeric data available we cannot make that determination. For this post, we deliberately used a low quality version of ZIP Code boundaries. It helps to make the point that the cost of an incorrect decision plays into this process. If its high, then it could be worth the investment in a higher quality (and more expensive) ZIP Code map product (or spend the labor money to do a visual audit of the data).
So far, we have been discussing the two extremes of the problem. One one side, the address geocodes accurately to its location; on the other side, to a ZIP Code centroid. Suppose the geocode is a middle case, of where the address geocodes to a ZIP+4 centroid. Depending on your level of tolerance for incorrect boundary assignments, you might consider ZIP+4 centroids good enough for a point-in-polygon test. You just have to keep in mind that ZIP+4s represent a collection of addresses (e.g. homes and businesses) along a section of a street. In urban and suburban area, the street section usually is short. Some rural ones however can extend quite some distance (even miles). We have seen cases in which the ZIP+4 crosses county lines, and even a case recently in which the ZIP+4 stretched across a state line.
In all this discussion, we are putting aside the technicality that addresses represent land parcels and not infinitely tiny points of latitude and longitude on the earth. If ￼￼￼￼￼￼￼￼you need to know if the land parcel identified by the address is inside a boundary, then that’s a different problem for another day.
Also, an important caveat to the boundary fall back approach is that you need to be able to assume that the ZIP Code you received is correct. Given enough data in the input, some geocoders will verify and correct the ZIP Code, even if it does not know where to plot the address on a map. That’s because ZIP Codes are assigned through textual matches against an address dictionary instead of a spatial operation. Once again, it is a matter of tolerance for incorrect decisions.
My general experience is that in ten to twenty percent of geocoding queries, the map coordinates you get back are at something less than the address level, with ZIP Code being the most common fallback. Even in these days of hi-tech geocoding software, the operations are still subject to the quality variations in the input: user input errors, directionals omitted, incorrect city provided, out of date street name used, odd abbreviations, etc. Garbage input – noise we sometimes call it politely – easily can trip up a geocoder. Less cynically, perhaps the address is new construction and not yet known to the geocoding engine. Regardless, when geocoding fails to produce a roof top hit, you still have programatic options to solving boundary assignment problems.