2022-05-02 12:00:00
Regex-trie solution described here: https://stackoverflow.com/a/42789508/6254147
When parsing a large number of documents to find addresses, edge cases are likely to come up, but handling the even most common variations quickly degrades the performance of regular expression matching. Searching structured address files or free text content that is known to contain one or more postal addresses can be optimized.
For my purpose, and for anyone searching arbitrary documents for addresses, a much quicker filter can reduce the dataset to make extracting whole addresses more tractable. Doing this by matching only the two-letter state code and valid 5 digit zipcodes with a regular expression reduces the set of pages to search for complete addressses substantially. The regex generated by the described proccess is included at the end of this document.
A filter to select pages containing street addresses or reject pages that have no valid address should be performant, and for a regular expression this means several things:
Since fast text matching across a large document corpus is not an uncommon task in programming, as with many others, smart people have developed optimal strategies that can be adapted for the task at hand.
Not every address in the US is simple, but there is a lot less complexity in two components, the two character alphabetic state code and the five character numeric zip code, and it helps to further reduce complexity to incorporate the fact that they are related.
From the naive regular expression above, one way to match this pair of address components might be to match two capital letters and then five numbers:
/[A-Z]{2} [0-9]{5}/
But of course, not every pair of letters is a valid state or territory (there are 26 * 26 = 676 possible pairs, but fewer than 100 codes the US Postal Service will recognize as a state or territory ) and not every 5-digit number is a valid zipcode (out of 99999 + 1 (for '00000') possibilities, there are around 50,000 zipcodes).
What is more, not every state matches every zip code ("New York, NY 90210" is not valid). So we might imagine a regular expression that compares fized strings for known pairs to the text. That would lead to large string (~1Mb) list of every valid state/zipcode pair.
To do this manually would take some tedious work and be failure prone, depending on the base data, and brittle, as zipcodes can and do change: since the range of zip codes that match New York State is between '10001' and '14975' (with exceptions for an apparently very fancy part of New York called Fisher's Island, which is closer to Connecticut and has the zipcode 06390, and Holtsville, NY having the zipcodes 00501 and 00544, so matching New York addresses can skip many comparisons with a regex like
/NY [0-1][0-9][0-9][0-9][0-9]/
discarding any match to "NY
However, since only three (that I know of) zipcodes in New York state begin with "0", we can move these exceptional cases out of the primary comparison and save processing time, using something like
/NY 1[0-4][0-9][0-9][0-9]/
If searching through petabytes of data, this may save you calendar time compared to implementing, debugging, and maintaining the more complex match. With datasets that can fit into memory on a single machine, it may not make sense to optimize much at all.
A data structure that allows us to 'fail early' and not make many uneccesary comparisons is a trie (pronounced ambiguously with the tree data structure, as in "retrieval"). With a trie, as soon a character is encountered to form a string not in the entire list, it will exit.
After beginning to match New York State, "N...Y..." the regex continues on to look for a number that will match the first digit of a zipcode in the state of New York. (A non-state string "N...X." drops the failed match for any state and moves on).
Whether this level of optimization makes sense to implement depends on your use case, and creating it from scratch could cost more time than simply running an exhaustive match list. Fortunately, implementing a trie to regex generator has been done already, and only requires the addition of zip code lists, I can take the time to write this out and not spend this time implementing it on my own.
Typical formats for addresses may at first appear easy to parse, but real-world addresses are notoriously more difficult
Named Entity
123 Example St.
Townville, PA 11111
Salt Lake City Chamber of Commerce
201 South Main Street #2300
Salt Lake City, UT 84111
They say: "when you try to solve a problem with regular expressions: 'Now you have two problems'." I attempt to manage this truism back down to a single problem, by only using regexps to get halfway through.
A very naive pseudocode regex to match the simple example might look like:
/(?<name>[A-Z][a-z]+)
(?<street_number>[0-9]+) (?<street_name>[A-Z][a-z]+) (?<street_type>[A-Z][a-z]\.)
(?<city>\w+), (?<state_abbreviation>[A-Z]{2}) (?<zip_code>[0-9]{5})/
I cannot and will not go into any of many the ways this regex is suboptimal and/or wrong, or iterative improvements on it. Just to say the simple example is going to fail to find any of the very many addresses in "Los Angeles" or "New York City" (with multiple words in the city name), anyone living in an apartment (that would likely, but not always follow after the street type), having a hyphenated name (non-word character), and so many other cases.
A regular expression that is robust to these and other variations becomes much longer, much slower to run, may be actually impossible, and is not the focus of this article or its conclusion, so look elsewhere if you really need to use regex to match an address (and you probably don't, really).
Catchall page describing postal addresses for most if not all world countries.
http://www.columbia.edu/~fdc/postal/
https://news.ycombinator.com/item?id=17577127
http://iregex.org/blog/trie-in-python.html
[https://techbio.org/wiki/Addresses/zipcode-trie-regex](Regex for two-letter state and zipcode pairs)