dotbot | DotNetDotCom.org
i find it cool that 7% of the web is not there.
We are just a few Seattle based guys trying to figure out how to make internet data as open as possible. You should be able to find everything you are looking for below. If not feel free to contact us. Happy Surfing!
Whole Internet in one file :)Whoosh
Whoosh: a fast pure-Python search engine
Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python.
Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Some of Whoosh's features include: * Pythonic API. * Pure-Python. No compilation or binary packages needed, no mysterious crashes. * Fielded indexing and search. * Fast indexing and retrieval -- much faster than any other pure-Python solution. * Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc. * Powerful query language parsed by pyparsing. * Pure Python spell-checker (as far as I know, the only one).
Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Whoosh was created and is maintained by MattChaput. It was originally created for use in Side Effects Software's 3D animation software Houdini. Side Effects Software Inc. graciously agreed to open-source the code.
"Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python."A List Apart: Articles: Indexing the Web—It’s Not Just Google’s Business
a basic one about optimizing database query execution time
Indexing the WebOfficial Google Webmaster Central Blog: Flash indexing with external resource loading
A team member thought we should add an index on a 90 million row table to improve performance. The field on which he wanted to create this index had only four possible values. To which I replied that an index on a low cardinality field wasn't really going to help anything. My boss then asked me why wouldn't it help? I sputtered around for a response but ended up telling him that I'd get back to him with a reasonable explanation.
Imported from http://twitter.com/newsycombinator/status/2645303258 How b-tree database indexes work and how to tell if they are efficient http://bit.ly/dd6mfOfficial Google Webmaster Central Blog: Optimize your crawling & indexing
This week at Google I/O, Google talked a lot about the evolution of the technological capabilities of the web. HTML 5 is ushering in new era of browser-based
Great, comprehensive article
Talking about AJAX and Flash searchability
Google is working on it The Google Webmaster Central team has been providing a wealth of education around these issues to help developers build search-friendly web sites. For instance: * Search-friendly AJAX * Canonicalization * Site moves At Maile’s Search-Friendly Development session at Google I/O, Google announced two advances in their ability to crawl and index RIAs. While both of these advances are great efforts, they were driven entirely by the search team. And unfortunately, they don’t solve the issues with the Google Code APIs. Wouldn’t it be great if the new Web Elements they just announced were search-engine friendly by default?Sphinx - text search The Pirate Bay way • The Register
and it's on track to become the open source world's canonical answer to the question of text search. MySQL and Solr, the two popular solutions, are showing their age. MySQL introduced full-text search in late 2000 as a way to more intelligently search blobs of text stored in databases. You can work a full-text clause into a query, and MySQL will rank the result rows by how relevant it thinks they are to the query. MySQL uses textbook search algorithms and doesn't allow for a lot of relevance tuning. It's like a drawing from a five year old: The heart is in the right place, but everybody knows that kids suck at drawing. Implementation details aside, MySQL still suffers from scalability problems. Having ignored the trend of chip manufacturers to build multiple cores into CPUs, hoping that this unpleasant trend that required them to actually think about multi-threading would just blow over sooner or later, MySQL's ability to handle parallelism is, well, see the five year old's drawing.
Sphinx can index 10 megabytes of data per second and can search up to 100 gigabytes of text on a single processor. It also supports multi-machine distributed searching, as in the case of Craigslist.A fast, fuzzy, full-text index using Redis | PlayNice.ly
PlayNice.ly is entirely based on a data-structure server called Redis. Redis is one of several new key-value databases which break away from traditional relational data architecture. It is simple, flexible, and blazingly fast. So why not use the tools we have already?
redis.smembers("word:" + metaphone("python"))
Interesting post about being able to search data in redis using indexing and phonetic algorthms.