Finding Addresses in Webpages

Regex-trie solution described here: https://stackoverflow.com/a/42789508/6254147

Larger Corpora

When parsing a large number of documents to find addresses, edge cases are likely to come up, but handling the even most common variations quickly degrades the performance of regular expression matching. Searching structured address files or free text content that is known to contain one or more postal addresses can be optimized.

For my purpose, and for anyone searching arbitrary documents for addresses, a much quicker filter can reduce the dataset to make extracting whole addresses more tractable. Doing this by matching only the two-letter state code and valid 5 digit zipcodes with a regular expression reduces the set of pages to search for complete addressses substantially. The regex generated by the described proccess is included at the end of this document.

A filter to select pages containing street addresses or reject pages that have no valid address should be performant, and for a regular expression this means several things:

be specific
don't be (too) greedy
be concise
fail early

Since fast text matching across a large document corpus is not an uncommon task in programming, as with many others, smart people have developed optimal strategies that can be adapted for the task at hand.

My task (identifying whether a document contains a US street address) then reduces to:

implement a preexisting performant set matching algorithm or source code
create my match set
iterate through the documents in the corpus to filter matches for the more intensive street address extraction task

Finding efficiencies

Not every address in the US is simple, but there is a lot less complexity in two components, the two character alphabetic state code and the five character numeric zip code, and it helps to further reduce complexity to incorporate the fact that they are related.

From the naive regular expression above, one way to match this pair of address components might be to match two capital letters and then five numbers:

/[A-Z]{2} [0-9]{5}/

But of course, not every pair of letters is a valid state or territory (there are 26 * 26 = 676 possible pairs, but fewer than 100 codes the US Postal Service will recognize as a state or territory ) and not every 5-digit number is a valid zipcode (out of 99999 + 1 (for '00000') possibilities, there are around 50,000 zipcodes).

Match on a (long|list|of|alternates)?

What is more, not every state matches every zip code ("New York, NY 90210" is not valid). So we might imagine a regular expression that compares fized strings for known pairs to the text. That would lead to large string (~1Mb) list of every valid state/zipcode pair.

Constructing special regexes to match known state/zipcode pairs

To do this manually would take some tedious work and be failure prone, depending on the base data, and brittle, as zipcodes can and do change: since the range of zip codes that match New York State is between '10001' and '14975' (with exceptions for an apparently very fancy part of New York called Fisher's Island, which is closer to Connecticut and has the zipcode 06390, and Holtsville, NY having the zipcodes 00501 and 00544, so matching New York addresses can skip many comparisons with a regex like

/NY [0-1][0-9][0-9][0-9][0-9]/

discarding any match to "NY " for any digit larger than the leading number "1".

However, since only three (that I know of) zipcodes in New York state begin with "0", we can move these exceptional cases out of the primary comparison and save processing time, using something like

/NY 1[0-4][0-9][0-9][0-9]/

If searching through petabytes of data, this may save you calendar time compared to implementing, debugging, and maintaining the more complex match. With datasets that can fit into memory on a single machine, it may not make sense to optimize much at all.

Enter the data structures solution

A data structure that allows us to 'fail early' and not make many uneccesary comparisons is a trie (pronounced ambiguously with the tree data structure, as in "retrieval"). With a trie, as soon a character is encountered to form a string not in the entire list, it will exit.

After beginning to match New York State, "N...Y..." the regex continues on to look for a number that will match the first digit of a zipcode in the state of New York. (A non-state string "N...X." drops the failed match for any state and moves on).

Whether this level of optimization makes sense to implement depends on your use case, and creating it from scratch could cost more time than simply running an exhaustive match list. Fortunately, implementing a trie to regex generator has been done already, and only requires the addition of zip code lists, I can take the time to write this out and not spend this time implementing it on my own.

Finding addresses (US)

Typical formats for addresses may at first appear easy to parse, but real-world addresses are notoriously more difficult

Stereotypical:

    Named Entity
    123 Example St.
    Townville, PA 11111

Actual and typical:

    Salt Lake City Chamber of Commerce
    201 South Main Street #2300
    Salt Lake City, UT 84111

Regular expressions are half the answer

They say: "when you try to solve a problem with regular expressions: 'Now you have two problems'." I attempt to manage this truism back down to a single problem, by only using regexps to get halfway through.

A very naive pseudocode regex to match the simple example might look like:

/(?<name>[A-Z][a-z]+)
(?<street_number>[0-9]+) (?<street_name>[A-Z][a-z]+) (?<street_type>[A-Z][a-z]\.)
(?<city>\w+), (?<state_abbreviation>[A-Z]{2}) (?<zip_code>[0-9]{5})/

I cannot and will not go into any of many the ways this regex is suboptimal and/or wrong, or iterative improvements on it. Just to say the simple example is going to fail to find any of the very many addresses in "Los Angeles" or "New York City" (with multiple words in the city name), anyone living in an apartment (that would likely, but not always follow after the street type), having a hyphenated name (non-word character), and so many other cases.

A regular expression that is robust to these and other variations becomes much longer, much slower to run, may be actually impossible, and is not the focus of this article or its conclusion, so look elsewhere if you really need to use regex to match an address (and you probably don't, really).

About:

Catchall page describing postal addresses for most if not all world countries.

Sources:

http://www.columbia.edu/~fdc/postal/

https://news.ycombinator.com/item?id=17577127

http://iregex.org/blog/trie-in-python.html

[https://techbio.org/wiki/Addresses/zipcode-trie-regex](Regex for two-letter state and zipcode pairs)

Finding Zipcodes in Webpages | Addresses