Tropo / Dave / Bookmarks : tokenization

Treebank tokenization
    Our tokenization is fairly simple: most punctuation is split from adjoining words double quotes (") are changed to doubled single forward- and backward- quotes (`` and '') verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately. Examples children's --> children 's parents' --> parents ' won't --> wo n't gonna --> gon na I'm --> I 'm
    http://www.cis.upenn.edu/~treebank/tokenization.html
    tags: nlp tokenization

About TextTiling
    http://people.ischool.berkeley.edu/~hearst/tiling-about.html
    tags: ir nlp segmentation tokenization passage

Extremely Fast Text Feature Extraction for Classification an...
    Keyword(s): text mining, text indexing, bag-of-words, feature engineering, feature extraction, document categorization, text tokenization Abstract: Most research in speeding up text mining involves algorithmic improvements to induction algorithms, and yet for many large scale applications, such as classifying or indexing large document repositories, the time spent extracting word features from texts can itself greatly exceed the initial training time. This paper describes a fast method for text feature extraction that folds together Unicode conversion, forced lowercasing, word boundary detection, and string hash computation. We show empirically that our integer hash features result in classifiers with equivalent statistical performance to those built using string word features, but require far less computation and less memory.
    http://www.hpl.hp.com/techreports/2008/HPL-2008-91R1.html?mt...
    tags: ir parsing text tokenization hp

 


Search for tokenization on del.icio.us