home | data | projects | wiki |
Home | ToC

Concepts | WebTodoNov

2007-11-22 19:31:45

Google is a massive engine for organizing the world's information. With hundreds of dedicated ways and means in their algorithms, tools, specialized sources and systems, Google provides a vast data resource. A great deal of information is available on the web, through web pages, databases, and meta-information through the linking structure available, a key aspect of Google's successful sorting and classifying ability. A mostly untapped extension of this built in capability exists.

In the sense of a lookup table, the web exists as an endpoint for search queries. An example is the trigonometric table data in textbooks, with precomputed values for sines, cosines and tangents. In the search sense, links for various values would return a result from the tabular data resource, especially relevant useful results may be located if specific table rows are independently provided under unique URLs, on separate pages.

Another simple example is the online dictionary. A word is defined on a unique page, a search for the word returns the definition, sometimes including context examples, usage, and historical information. Extending this format, precomputed mathematical transformations (inputs and outputs of linear and non-linear functions) and word contexts for words in the abstract information theory sense may be returned for specified domain Google searches in a properly optimized site hierarchy. Perhaps using windows of time series data, and computing statistical observations for extended information may be organized to target data returned in the top ten Google results which may be compiled to a further useful result, leveraging computationally tractable information with Google's capability to arrive at results untractable within the originating web server's limitations.

The type of results that are most likely to be efficiently arrived at through this technique will be combinatorially difficult, highly parallelized, but sharing canonical data points--the pivot search terms used to locate them. What computations are these? Characteristics of these sets might include: very high cardinality of dataset complex cyclic or randomly distributed partially heterogenous outputs discrete, deterministic functions, or results meaningfully centered onto overlapping binary strings (1.999 is very close to 2.002, but have poor ASCI text correlation, but rounding to 2.0 groups them very well)

Datasets with these characteristics may include: biochemical or genomic strings weather data geolocated datapoints for correlation time-series correlations over financial instrument prices, news article text, biometric monitor recording factorization of large numbers hashmap encoded data (noisy, random string reference to stored plaintext data).