Home - GitHub
Project Goose is an article extractor written in Java using Maven for the dependencies. It's an open source project born from Gravity Labs http://gravity.com, Its goal is to take a webpage, perform calculations and extract the main text of the article as well as make recommendations on what image might be the most relevant image on the page.
https://github.com/jiminoc/goose/wiki
tags: nlp ir
List of resources: Article text extraction from HTML documen...
http://tomazkovacic.com/blog/56/list-of-resources-article-te...
tags: ir
Mining of Massive Datasets
http://www.scribd.com/doc/46052657/Untitled?secret_password=...
tags: ir ml
Recommendations research research papers collection | Mendel...
collaborative filtering
http://www.mendeley.com/research-papers/collections/796791/R...
tags: ir cf
N-gram data from Project Gutenberg | Prashanth Ellina
http://blog.prashanthellina.com/2008/05/04/n-gram-data-from-...
tags: gutenberg ir ngram
Bollysite Blog » Blog Archive » Python coded GoogleMini SA...
http://blog.bollysite.com/2010/02/08/python-coded-googlemini...
tags: gae ir search sayt
Compendium of Lost Words
Welcome to the Compendium of Lost Words, a component of The Phrontistery. The Compendium lists over 400 of the rarest modern English words - in fact, ones that have been entirely absent from the Internet, including all online dictionaries, until now. By revealing the existence of these words online, I do not necessarily promote their revival, but I do encourage an appreciation of the flexibility of English vocabulary. In theory, the Compendium will be the only web page on which each of these words occurs in its proper English context. Click on a link below to take you to the four main Compendium pages, organized alphabetically by word, or on the links for more information about the site.
http://phrontistery.info/clw.html
tags: ir
Open Text Summarizer
Automatic text summarization is the technique, where a computer program summarizes a document. A text is put into the computer and a highlighted (summarized) text is returned. The Open Text Summarizer is an open source tool for summarizing texts. The program reads a text and decides which sentences are important and which are not. It ships with Ubuntu, Fedora and other linux distros. OTS supports many (25+) languages which are configured in XML files.
http://libots.sourceforge.net/
tags: ir summarization
IR Datasets
http://boston.lti.cs.cmu.edu/callan/Data/#Web
tags: data ir
Introduction to Information Retrieval
This is the companion website for the following book. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.
http://www-csli.stanford.edu/~hinrich/information-retrieval-...
tags: ir search clustering informationretrieval information-retrieval
LingPipe: Competition
On this page, we break our competition down into academic toolkits and industrial toolkits. We only consider software that is available for linguistic processing, not companies that rely on linguistic processing in an application but do not sell that technology. How does LingPipe compare to the below offerings?
http://alias-i.com/lingpipe/web/competition.html
tags: ir nlp
Statistical NLP / corpus-based computational linguistics res...
http://www-nlp.stanford.edu/links/statnlp.html
tags: ir nlp
SCIgen - An Automatic CS Paper Generator
SCIgen is a program that generates random Computer Science research papers, including graphs, figures, and citations. It uses a hand-written context-free grammar to form all elements of the papers. Our aim here is to maximize amusement, rather than coherence.
http://pdos.csail.mit.edu/scigen/
tags: markov ir language textgeneration
The Stanford NLP (Natural Language Processing) Group
Named Entity Recognition (NER) and Information Extraction (IE)
http://nlp.stanford.edu/ner/index.shtml
tags: nlp ir ie namedentity
About TextTiling
http://people.ischool.berkeley.edu/~hearst/tiling-about.html
tags: ir nlp segmentation tokenization passage
Kea - keyphrase extraction
KEA is an algorithm for extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a controlled vocabulary.
http://www.nzdl.org/Kea/
tags: ir nlp search kea
Extremely Fast Text Feature Extraction for Classification an...
Keyword(s): text mining, text indexing, bag-of-words, feature engineering, feature extraction, document categorization, text tokenization Abstract: Most research in speeding up text mining involves algorithmic improvements to induction algorithms, and yet for many large scale applications, such as classifying or indexing large document repositories, the time spent extracting word features from texts can itself greatly exceed the initial training time. This paper describes a fast method for text feature extraction that folds together Unicode conversion, forced lowercasing, word boundary detection, and string hash computation. We show empirically that our integer hash features result in classifiers with equivalent statistical performance to those built using string word features, but require far less computation and less memory.
http://www.hpl.hp.com/techreports/2008/HPL-2008-91R1.html?mt...
tags: ir parsing text tokenization hp
Flesch-Kincaid Readability Test - Wikipedia, the free encycl...
The Flesch/Flesch–Kincaid Readability Tests are readability tests designed to indicate comprehension difficulty when reading a passage of contemporary academic English. There are two tests, the Flesch Reading Ease, and the Flesch–Kincaid Grade Level. Although they use the same core measures (word length and sentence length), they have different weighting factors, so the results of the two tests correlate imperfectly: a text with a higher score on the Reading Ease test over another text may have a lower score on the Grade Level test. Both systems were devised by Rudolf Flesch.
http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test
tags: ir readability
Gunning fog index - Wikipedia, the free encyclopedia
In linguistics, the Gunning fog index is a test designed to measure the readability of a sample of English writing. The resulting number is an indication of the number of years of formal education that a person requires in order to easily understand the text on the first reading. That is, if a passage has a fog index of 12, it has the reading level of a U.S. high school senior. The test was developed by Robert Gunning, an American businessman, in 1952.[1] The fog index is generally used by people who want their writing to be read easily by a large segment of the population. Texts that are designed for a wide audience generally require a fog index of less than 12.
http://en.wikipedia.org/wiki/Gunning_fog_index
tags: readability ir
SEOmoz | Google Search Engine Ranking Factors
http://www.seomoz.org/article/search-ranking-factors
tags: google pagerank ir
A Measure of Deviations from Poisson
Low frequency words tend to be rich in content, and vice versa. But not all equally frequent words are equally mean!ngful. We will use inverse document frequency (IDF), a quantity borrowed from Information Retrieval, to distinguish words like somewhat and boycott. Both somewhat and boycott appeared approximately 1000 times in a corpus of 1989 Associated Press articles, but boycott is a better keyword because its IDF is farther from what would be expected by chance (Poisson).
http://www.aclweb.org/anthology-new/W/W95/W95-0110.pdf
tags: ir tfidf search filetype_pdf media_document
MULTI-PARAGRAPH SEGMENTATION OF EXPOSITORY TEXT
his paper describes TextTiling, an algorithm for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of the texts. The algorithm uses domain-independent lexical frequency and distribution inform
http://people.ischool.berkeley.edu/~hearst/papers/tiling-acl...
tags: nlp ir
Classifiers Without Borders: Incorporating Fielded Text From...
Accurate web page classification often depends crucially on information gained from neighboring pages in the local web graph. Prior work has exploited the class labels of nearby pages to improve performance. In contrast, in this work we utilize a weighted
http://www.cse.lehigh.edu/~brian/pubs/2008/classifiers-witho...
tags: ir smearing
What is Krovetz Stemming?
The Krovetz Stemmer was developed by Bob Krovetz, at the University of Massachusetts, in 1993. It is quite a 'light' stemmer, as it makes use of inflectional linguistic morphology.
http://www.comp.lancs.ac.uk/computing/research/stemming/gene...
tags: ir
MG4J: Managing Gigabytes for Java™
http://mg4j.dsi.unimi.it/
tags: ir lucene
|