datamining

Pages tagged datamining:

Data Mining with R: learning by case studies
http://www.liaad.up.pt/~ltorgo/DataMiningWithR/

R is a really excellent tool ... i use it to analyse performance data from tuning sessions ....

A Guide to Recommender Systems - ReadWriteWeb
http://www.readwriteweb.com/archives/recommender_systems.php

Various methods for doing user recommendations.

fully understand recommendation engines is there any brand application?

Le RWW inaugure une série de billets sur les systèmes de recommandations, leurs fonctionnements, les différents types de systèmes existants dans différents domaines.

Mining The Thought Stream
http://www.techcrunch.com/2009/02/15/mining-the-thought-stream/
How Google and Facebook are using R : Data Evolution
http://dataspora.com/blog/predictive-analytics-using-r/

This looks like a fun language to play with.

google tools

Last night, I moderated our Bay Area R Users Group kick-off event with a panel discussion entitled “The R and Science of Predictive Analytics”, co-located with the Predictive Analytics World conference here in SF. The panel comprised of four recognized R users from industry: * Bo Cowgill, Google * Itamar Rosenn, Facebook * David Smith, Revolution Computing * Jim Porzak, The Generations Network (and Co-Chair of our R Users Group)

looks like a promising read

DeepPeep: discover the hidden web
http://www.deeppeep.org/index.jsp

DeepPeep is a search engine specialized in Web forms. The current beta version currently tracks 13,000 forms across 7 domains. DeepPeep helps you discover the entry points to content in Deep Web (aka Hidden Web) sites, including online databases and Web services.

Moteur de recherche pour le web invisible

Twitter Technology Blog: We Got Data
http://dev.twitter.com/2008/10/we-got-data.html
JDMP » Java Data Mining Package » About
http://www.jdmp.org/

The Java Data Mining Package (JDMP) is an open source Java library for data analysis and machine learning. It facilitates the access to data sources and machine learning algorithms (e.g. clustering, regression, classification, graphical models, optimization) and provides visualization modules. It includes a matrix library for storing and processing any kind of data, with the ability to handle very large matrices even when they do not fit into memory. Import and export interfaces are provided for JDBC data bases, TXT, CSV, Excel, Matlab, Latex, MTX, HTML, WAV, BMP and other file formats. JDMP provides a number of algorithms and tools, but also interfaces to other machine learning and data mining packages (Weka, LibSVM, Mallet, Lucene, Octave).

Data mining and visualisation tool that connects to a number of data sources (including Matlab and Weka)

datamining

LGPL 3

Amazon Web Services Blog: New AWS Public Data Sets - Economics, DBpedia, Freebase, and Wikipedia
http://aws.typepad.com/aws/2009/02/new-aws-public-data-sets-economics-dbpedia-freebase-and-wikipedia.html

We have just released four additional AWS public data sets, and have updated another one. In the Economics category, we have added a set of transportation databases from the US Bureau of Transportation Statistics. Data and statistics are provided for aviation, maritime, highway, transit, rail, pipeline, bike & pedestrian, and other modes of transportation, all in CSV format. I was able to locate employment data for our hometown airline and found out that they employed 9,322 full-time and 1,122 part-time employees as of the end of 2007. In the Encyclopedic category, we have added access to the DBpedia Knowledge Base, the Freebase Data Dump, and the Wikipedia Extraction, or WEX.

amazon

twendz : Exploring Twitter Conversations and Sentiment
http://twendz.waggeneredstrom.com/
10 papers you need to read | Science for SEO
http://www.scienceforseo.com/information-retrieval/10-papers-you-need-to-read/

This is a list of my top 10 freely available papers on the topic of information retrieval. You will notice that they are rather old, but the techniques used described and the findings are not always dated. Those that dated are important nonetheless because they provide a good foundation to understanding why things are as they are in information retrieval these days.

De-anonymizing Social Networks
http://randomwalker.info/social-networks/index.html

De-anonymizing Social Networks

Pittsburgh Pattern Recognition
http://facemining.pittpatt.com/

facial recognition Star Trek

Navigate video by facial recognition, demo'd on the original Star Trek

face mining - nice

Now that's a good use of facial recognition!

Data.gov
http://www.data.gov/

The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government. Although the initial launch of Data.gov provides a limited portion of the rich variety of Federal datasets presently available, we invite you to actively participate in shaping the future of Data.gov by suggesting additional datasets and site enhancements to provide seamless access and use of your Federal data. Visit today with us, but come back often. With your help, Data.gov will continue to grow and change in the weeks, months, and years ahead.

WOW "The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government."

The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government.

The new U.S. federal open data site is live! "Data.gov will open up the workings of government by making economic, healthcare, environmental, and other government information available on a single website, allowing the public to access raw data and transform it in innovative ways."

The Three Sexy Skills of Data Geeks : Dataspora Blog
http://dataspora.com/blog/sexy-data-geeks/

istograms, where labels and colors are minimally set by default. Their goal is to help develop a hypothesis about the data, and their audience typically numbers one or a small team. A second kind of data visualization are those intended to communicate to a wider audience, whose goal is to visually advocate for a hypothesis. While most d

The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.

parsing, and proofing one’s data before it’s suitable for analysis. Real world data is messy. At best it’s inconsistently delimited or packed into an unnecessarily complex XML s

Data Evolution
http://dataspora.com/blog/

ittle to do with their lascivious leanings (ahem, BedPost), and more with the scarcity of their skills. I believe that the folks to whom Hal Varian is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas: statistics, data munging, and data visualization. (In parentheses next to each, I’ve put the salient character trait needed to acquire it).

sdfsgasdg

Video: Designing for Big Data, by Jeffrey Veen
http://www.veen.com/jeff/archives/001000.html

This is a 20-minute talk I gave at the Web2.0 Expo in San Francisco a couple weeks ago. In it, I describe two trends: how we're shifting as a culture from consumers to participants, and how technology has enabled massive amounts of data to be recorded, stored, and analyzed. Putting those things together has resulted in some fascinating innovations that echo data visualization work that's been happening for centuries.

highlighting the shifts in design techniques for streams rather than a drop of data.

Jeff Veen: "This is a 20-minute talk I gave at the Web2.0 Expo in San Francisco a couple weeks ago. In it, I describe two trends: how we're shifting as a culture from consumers to participants, and how technology has enabled massive amounts of data to be recorded, stored, and analyzed. Putting those things together has resulted in some fascinating innovations that echo data visualization work that's been happening for centuries."

I describe two trends: how we're shifting as a culture from consumers to participants, and how technology has enabled massive amounts of data to be recorded, stored, and analyzed. Putting those things together has resulted in some fascinating innovations that echo data visualization work that's been happening for centuries.

Why 1974 was the seminal year for Web 2.0

Rise of the Data Scientist | FlowingData
http://flowingdata.com/2009/06/04/rise-of-the-data-scientist/

Interesting!

Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2 | Cloudera
http://www.cloudera.com/hadoop-data-intensive-application-tutorial

This tutorial will show you how to use Amazon EC2 and Cloudera's Distribution for Hadoop to run batch jobs for a data intensive web application. During the tutorial, we will perform the following data processing steps:

* Configure and launch a Hadoop cluster on Amazon EC2 using the Cloudera tools * Load Wikipedia log data into Hadoop from Amazon Elastic Block Store (EBS) snapshots and Amazon S3 * Run simple Pig and Hive commands on the log data * Write a MapReduce job to clean the raw data and aggregate it to a daily level (page_title, date, count) * Write a Hive query that finds trending Wikipedia articles by calling a custom mapper script * Join the trend data in Hive with a table of Wikipedia page IDs * Export the trend query results to S3 as a tab delimited text file for use in our web application's MySQL database

This tutorial will show how to use Amazon EC2 and Cloudera's Distribution for Hadoop to run batch jobs for a data intensive web application.

FlightCaster
http://flightcaster.com/

flight delay prediction, way better than what the airlines tell you

"Anonymized" data really isn't—and here's why not - Ars Technica
http://arstechnica.com/tech-policy/news/2009/09/your-secrets-live-online-in-databases-of-ruin.ars

birthdate

Machine learning classifier gallery
http://home.comcast.net/~tom.fawcett/public_html/ML-gallery/pages/index.html

Interesting comparative performance of various algorithms on different data

A highly informative visualization of the biases of different ML classifiers. Really useful, especially for talks to non-experts.

Evolution of a Revolution: Visualizing Millions of Iran Tweets
http://www.readwriteweb.com/archives/evolution_revolution_visualizing_millions_iran_tweets.php

Visualizing Millions of Iran Tweets - computational history of news using twitter

At its peak, a search for "Iran" on Twitter generated over 100,000 tweets per day and over 8,000 tweets per hour. The plot just below shows the growth in volume of information in the number of tweets per hour. How does an Internet junkie, news organization, or political operative monitor rapidly evolving real-time events, from the crucial details to the bigger picture? More importantly, how can a data stream be turned into real-time action, reaching the people who need it, when they need it, and in a form they can easily digest?

Article describes effort aimed at more sophistcicated analysis of twitter trends. Author is co-founder of Infoharmoni - startup building knowledge interfaces for real-time data sets.

How to algorithmically discover and deploy novel social structures is perhaps the billion, or trillion, dollar question. With Twitter, the data and API are in place. And if the history of computation is any guide, once programming a system becomes possible, progressing from a hack to an application to a platform is only a matter of time.

'...how can a data stream be turned into real-time action, reaching the people who need it, when they need it, and in a form they can easily digest? At the most abstract level, history and computation are the same thing: the evolution of systems over time. Twitter has several remarkable properties that allow us to finally leverage this correspondence in tangible ways. The simplicity of its data, the openness of its system, and its extreme time resolution make it possible for us to detect atoms of history, those moments when something is triggered and society is reconfigured ever so slightly. Simply tracking the volume of various phrases gives us a sense of what is happening on the street, literally and figuratively. But that signal is but a shadow of a far more complex and intricate reality, an interwoven web of individuals and actions. -- Disruptive events lead to information elites.'

Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition.
http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Hastie, Tibshirani and Friedman (2008). Springer-Verlag. Full-text PDF is free.

free online book

@dataspora: "The Elements of Statistical Learning, the authoritative text on the subject, now free at authors' site http://bit.ly/2J8WNK (ht @johndcook)" (from http://twitter.com/dataspora/status/4847621837)

Guide to Getting Started in Machine Learning | A Beautiful WWW
http://abeautifulwww.com/2009/10/11/guide-to-getting-started-in-machine-learning/
How Team of Geeks Cracked Spy Trade - WSJ.com
http://online.wsj.com/article/SB125200842406984303.html

Palantir Technologies has designed what many intelligence analysts say is the most effective tool to date to investigate terrorist networks. The software's main advance is a user-friendly search tool that can scan multiple data sources at once, something previous search tools couldn't do.

Palantir Technologies has designed what many intelligence analysts say is the most effective tool to date to investigate terrorist networks.

Michael Nielsen » The Google Technology Stack
http://michaelnielsen.org/blog/lecture-course-the-google-technology-stack/

Interesting set of links and posts describing the technologies Google builds its software on, and how they work together.

The Google Technology Stack … or as I would put it: An Introduction to MapReduce, Data Mining and PageRank

A great in-depth treament of the engine that powers Google

Part of what makes Google such an amazing engine of innovation is their internal technology stack: a set of powerful proprietary technologies that makes it easy for Google developers to generate and process enormous quantities of data. According to a senior Microsoft developer who moved to Google, Googlers work and think at a higher level of abstraction than do developers at many other companies, including Microsoft: “Google uses Bayesian filtering the way Microsoft uses the if statement” (Credit: Joel Spolsky). This series of posts describes some of the technologies that make this high level of abstraction possible.

pskomoroch's dataset Bookmarks on Delicious
http://delicious.com/pskomoroch/dataset

Resource list of public datasets

Zero Intelligence Agents » Must-Have R Packages for Social Scientists
http://www.drewconway.com/zia/?p=1614

will send it to Chopy

"If you conduct social science research but are desperately clinging onto your SAS, SPSS or Matlab licenses; waiting for someone to convince you of R’s value, please allow me to be the first to try".

Data Sets | GroupLens Research
http://www.grouplens.org/taxonomy/term/14
Collaborative Filtering with Ensembles - igvita.com
http://www.igvita.com/2009/09/01/collaborative-filtering-with-ensembles/

um nova tecnica para recomendação: Aplicar tecnicas especificas e juntar os resultados

Measuring Measures: Learning About Statistical Learning
http://measuringmeasures.blogspot.com/2010/01/learning-about-statistical-learning.html
IEEE Spectrum: The Million Dollar Programming Prize
http://www.spectrum.ieee.org/may09/8788

year-old Netflix Prize competition, offers a grand prize of US $1 million for an algorithm that’s 10 percent more accurate than the one Netflix uses to predict customers’ movie preferences.

Netflix's bounty for improving its movie-recommendation software is almost in the bag. Here is one team's account

Bell Labs explains their strategy for solving Netflix's collaborative filtering problem.

PeteSearch: How to split up the US
http://petewarden.typepad.com/searchbrowser/2010/02/how-to-split-up-the-us.html

Data visualization of Facebook profiles: "Looking at the network of US cities, it's been remarkable to see how groups of them form clusters, with strong connections locally but few contacts outside the cluster. For example Columbus, OH and Charleston WV are nearby as the crow flies, but share few connections, with Columbus clearly part of the North, and Charleston tied to the South. "Some of these clusters are intuitive, like the old south, but there's some surprises too, like Missouri, Louisiana and Arkansas having closer ties to Texas than Georgia. To make sense of the patterns I'm seeing, I've marked and labeled the clusters, and added some notes about the properties they have in common..."

Fun stuff, lots of entertaining demographic data.

According to Facebook

The Man Who Looked Into Facebook's Soul
http://www.readwriteweb.com/archives/facebook_user_data_analysis.php
Personal Data Mining | Creativity Online
http://creativity-online.com/?action=news:article&newsId=136077

Nice piece on the growing trend of data design or data visualization.

Science gleans 60TB of behavior data from Everquest 2 logs - Ars Technica
http://arstechnica.com/science/news/2009/02/aaas-60tb-of-behavioral-data-the-everquest-2-server-logs.ars

WEEK 8 -- 03/10/2010

In February 2009 Dmitri Williams

4 years, 400k players ~=60TB -- about 475k/s, slightly > 1k /user/sec.

Thanks to a partnership with Sony, a team of academic researchers have obtained the largest set of data on social interactions they've ever gotten their hands on: the complete server logs of Everquest 2, which track every action performed in the game.

m psychologists to epidemiologists have wondered for some time whether online, multiplayer games might provide some ways to test concepts that are otherwise difficult to track in the real world. A Saturday morning session at the meeting of the American Association for the Advancement of

Food for data miners

Gary Flake: is Pivot a turning point for web exploration? | Video on TED.com
http://www.ted.com/talks/gary_flake_is_pivot_a_turning_point_for_web_exploration.html

Gary Flake demos Pivot, a new way to browse and arrange massive amounts of images and data online. Built on breakthrough Seadragon technology, it enables spectacular zooms in and out of web databases, and the discovery of patterns and links invisible in standard web browsing.

Pivot

Your Facebook Profile Makes Marketers’ Dreams Come True | Epicenter
http://www.wired.com/epicenter/2009/04/your-facebook-profile-makes-marketers-dreams-come-true/

Your Facebook Profile Makes Marketers’ Dreams Come True

by Eliot Van Buskirk // Generally I stay away from Wired, which often has published technological fantasies and hype that have been downright silly -- and very misleading

Your Facebook Profile Makes Marketers’ Dreams Come True http://bit.ly/4rs9Z [from http://twitter.com/AdNerds/statuses/1659055245]

An Exercise in Species Barcoding
http://norvig.com/ibol.html

Recently I've been looking at the International Barcode of Life project. The idea is take DNA samples from animals and plants to help identify known species and discover new ones. While other projects strive to identify the complete genome for a few species, such as humans, dogs, red flour beetles and others, the barcoding project looks at a short 650-base sequence from a single gene. The idea is that this short sequence may not tell the whole story of an organism, but it should be enough to identify and distinguish between species. It will be successful as a barcode if (a) all (or most) members of a species have the same (or very similar) sequences and (b) members of different species have very different sequences.

プログラマーに最適なデータマイニングの教科書『集合知プログラミング』 - 図書館情報学を学ぶ
http://d.hatena.ne.jp/kunimiya/20081116/p1

統計周りの知識は一切ないのでこれから勉強する。

- 図書館情報学を学ぶ

Data-Intensive Text Processing with MapReduce
http://www.umiacs.umd.edu/~jimmylin/book.html
MeCabの辞書にはてなキーワードを追加しよう - 不可視点
http://d.hatena.ne.jp/code46/20090531/p1

-MeCabの辞書にはてなキーワードを追加しよう - 不可視点 http://j.mp/9SnTxA

http://d.hatena.ne.jp/rin1024/20090830/1251608698

naist-jdic 辞書ファイルの指定方法

What is data science? - O'Reilly Radar
http://radar.oreilly.com/2010/06/what-is-data-science.html

The future belongs to the companies who figure out how to collect and use data successfully. In this in-depth piece, O'Reilly editor Mike Loukides examines the unique skills and opportunities that flow from data science.

aspects Business Intelligence, Text Mining, and other statistical analysis

COS 493, Spring 2002: Schedule and Readings
http://www.cs.princeton.edu/courses/archive/spring02/cs493/schedule.html

Algorithms for Massive Data Sets

MetaOptimize Q+A - machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization
http://metaoptimize.com/qa/