Tropo / Dave / Bookmarks : ir

Home - GitHub
    Project Goose is an article extractor written in Java using Maven for the dependencies. It's an open source project born from Gravity Labs http://gravity.com, Its goal is to take a webpage, perform calculations and extract the main text of the article as well as make recommendations on what image might be the most relevant image on the page. 
    https://github.com/jiminoc/goose/wiki
    tags: nlp ir

List of resources: Article text extraction from HTML documen...
    http://tomazkovacic.com/blog/56/list-of-resources-article-te...
    tags: ir

Mining of Massive Datasets
    http://www.scribd.com/doc/46052657/Untitled?secret_password=...
    tags: ir ml

Recommendations research research papers collection | Mendel...
    collaborative filtering
    http://www.mendeley.com/research-papers/collections/796791/R...
    tags: ir cf

N-gram data from Project Gutenberg | Prashanth Ellina
    http://blog.prashanthellina.com/2008/05/04/n-gram-data-from-...
    tags: gutenberg ir ngram

Bollysite Blog » Blog Archive » Python coded GoogleMini SA...
    http://blog.bollysite.com/2010/02/08/python-coded-googlemini...
    tags: gae ir search sayt

Compendium of Lost Words
    Welcome to the Compendium of Lost Words, a component of The Phrontistery. The Compendium lists over 400 of the rarest modern English words - in fact, ones that have been entirely absent from the Internet, including all online dictionaries, until now. By revealing the existence of these words online, I do not necessarily promote their revival, but I do encourage an appreciation of the flexibility of English vocabulary. In theory, the Compendium will be the only web page on which each of these words occurs in its proper English context. Click on a link below to take you to the four main Compendium pages, organized alphabetically by word, or on the links for more information about the site.
    http://phrontistery.info/clw.html
    tags: ir

Open Text Summarizer
    Automatic text summarization is the technique, where a computer program summarizes a document. A text is put into the computer and a highlighted (summarized) text is returned. The Open Text Summarizer is an open source tool for summarizing texts. The program reads a text and decides which sentences are important and which are not. It ships with Ubuntu, Fedora and other linux distros. OTS supports many (25+) languages which are configured in XML files.
    http://libots.sourceforge.net/
    tags: ir summarization

IR Datasets
    http://boston.lti.cs.cmu.edu/callan/Data/#Web
    tags: data ir

Introduction to Information Retrieval
    This is the companion website for the following book. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.
    http://www-csli.stanford.edu/~hinrich/information-retrieval-...
    tags: ir search clustering informationretrieval information-retrieval

LingPipe: Competition
    On this page, we break our competition down into academic toolkits and industrial toolkits. We only consider software that is available for linguistic processing, not companies that rely on linguistic processing in an application but do not sell that technology. How does LingPipe compare to the below offerings?
    http://alias-i.com/lingpipe/web/competition.html
    tags: ir nlp

Statistical NLP / corpus-based computational linguistics res...
    http://www-nlp.stanford.edu/links/statnlp.html
    tags: ir nlp

SCIgen - An Automatic CS Paper Generator
    SCIgen is a program that generates random Computer Science research papers, including graphs, figures, and citations. It uses a hand-written context-free grammar to form all elements of the papers. Our aim here is to maximize amusement, rather than coherence.
    http://pdos.csail.mit.edu/scigen/
    tags: markov ir language textgeneration

The Stanford NLP (Natural Language Processing) Group
    Named Entity Recognition (NER) and Information Extraction (IE)
    http://nlp.stanford.edu/ner/index.shtml
    tags: nlp ir ie namedentity

About TextTiling
    http://people.ischool.berkeley.edu/~hearst/tiling-about.html
    tags: ir nlp segmentation tokenization passage

Kea - keyphrase extraction
    KEA is an algorithm for extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a controlled vocabulary.
    http://www.nzdl.org/Kea/
    tags: ir nlp search kea

Extremely Fast Text Feature Extraction for Classification an...
    Keyword(s): text mining, text indexing, bag-of-words, feature engineering, feature extraction, document categorization, text tokenization Abstract: Most research in speeding up text mining involves algorithmic improvements to induction algorithms, and yet for many large scale applications, such as classifying or indexing large document repositories, the time spent extracting word features from texts can itself greatly exceed the initial training time. This paper describes a fast method for text feature extraction that folds together Unicode conversion, forced lowercasing, word boundary detection, and string hash computation. We show empirically that our integer hash features result in classifiers with equivalent statistical performance to those built using string word features, but require far less computation and less memory.
    http://www.hpl.hp.com/techreports/2008/HPL-2008-91R1.html?mt...
    tags: ir parsing text tokenization hp

Flesch-Kincaid Readability Test - Wikipedia, the free encycl...
    The Flesch/Flesch–Kincaid Readability Tests are readability tests designed to indicate comprehension difficulty when reading a passage of contemporary academic English. There are two tests, the Flesch Reading Ease, and the Flesch–Kincaid Grade Level. Although they use the same core measures (word length and sentence length), they have different weighting factors, so the results of the two tests correlate imperfectly: a text with a higher score on the Reading Ease test over another text may have a lower score on the Grade Level test. Both systems were devised by Rudolf Flesch.
    http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test
    tags: ir readability

Gunning fog index - Wikipedia, the free encyclopedia
    In linguistics, the Gunning fog index is a test designed to measure the readability of a sample of English writing. The resulting number is an indication of the number of years of formal education that a person requires in order to easily understand the text on the first reading. That is, if a passage has a fog index of 12, it has the reading level of a U.S. high school senior. The test was developed by Robert Gunning, an American businessman, in 1952.[1] The fog index is generally used by people who want their writing to be read easily by a large segment of the population. Texts that are designed for a wide audience generally require a fog index of less than 12.
    http://en.wikipedia.org/wiki/Gunning_fog_index
    tags: readability ir

SEOmoz | Google Search Engine Ranking Factors
    http://www.seomoz.org/article/search-ranking-factors
    tags: google pagerank ir

A Measure of Deviations from Poisson
    Low frequency words tend to be rich in content, and vice versa. But not all equally frequent words are equally mean!ngful. We will use inverse document frequency (IDF), a quantity borrowed from Information Retrieval, to distinguish words like somewhat and boycott. Both somewhat and boycott appeared approximately 1000 times in a corpus of 1989 Associated Press articles, but boycott is a better keyword because its IDF is farther from what would be expected by chance (Poisson).
    http://www.aclweb.org/anthology-new/W/W95/W95-0110.pdf
    tags: ir tfidf search filetype_pdf media_document

MULTI-PARAGRAPH SEGMENTATION OF EXPOSITORY TEXT
    his paper describes TextTiling, an algorithm for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of the texts. The algorithm uses domain-independent lexical frequency and distribution inform
    http://people.ischool.berkeley.edu/~hearst/papers/tiling-acl...
    tags: nlp ir

Classifiers Without Borders: Incorporating Fielded Text From...
    Accurate web page classification often depends crucially on information gained from neighboring pages in the local web graph. Prior work has exploited the class labels of nearby pages to improve performance. In contrast, in this work we utilize a weighted
    http://www.cse.lehigh.edu/~brian/pubs/2008/classifiers-witho...
    tags: ir smearing

What is Krovetz Stemming?
    The Krovetz Stemmer was developed by Bob Krovetz, at the University of Massachusetts, in 1993. It is quite a 'light' stemmer, as it makes use of inflectional linguistic morphology.
    http://www.comp.lancs.ac.uk/computing/research/stemming/gene...
    tags: ir

MG4J: Managing Gigabytes for Java™
    http://mg4j.dsi.unimi.it/
    tags: ir lucene

 


Search for ir on del.icio.us