distributed

Pages tagged distributed:

sccache - Google Code
http://code.google.com/p/sccache/

The SHOP.COM Cache System is an object cache system that...

Plurk Open Source - LightCloud - Distributed and persistent key value database
http://opensource.plurk.com/LightCloud/

aid, here is what it takes to do 10.000 gets and sets:

Gamasutra - 1500 Archers on a 28.8: Network Programming in Age of Empires and Beyond
http://www.gamasutra.com/view/feature/3094/1500_archers_on_a_288_network_.php

Network Programming

This paper explains the design architecture, implementation, and some of the lessons learned creating the multiplayer (networking) code for the Age of Empires 1 & 2 games; and discusses the current and future networking approaches used by Ensemble Studios in its game engines.

RPyC: Unbounded Computing
http://rpyc.wikidot.com/

dynamic nature, to overcome the physical boundaries between processes and computers, so that remote objects can be manipulated as if they were local.

An RPC for python to build cloud and distributed systems

remote procedure calls in python, build a simple distributed computing platform

Advogato: GitTorrent, The Movie
http://advogato.org/article/994.html

More about the decentralisation of IT

"GitTorrent makes Git truly distributed. The initial plans are for reducing mirror loading, however the full plans include totally distributed development: no central mirrors whatsoever. PGP signing and other web-of-trust-based mechanisms will take over from protocols on ports (e.g. ssh) as the access control "clearing house". The implications of a truly distributed revision control system are truly staggering: unrestricted software freedom

That's exactly what I am looking for - yeah!

Imagine that an entire project - its web site, documentation, wiki, bug-tracker, source code and binaries are all managed and stored in a peer-to-peer distributed git repository.

"GitTorrent makes Git truly distributed. The initial plans are for reducing mirror loading, however the full plans include totally distributed development: no central mirrors whatsoever. PGP signing and other web-of-trust-based mechanisms will take over from protocols on ports (e.g. ssh) as the access control "clearing house". "

From a simple, simple project that is suffering from an inexplicable near complete lack of attention from the free software community comes a revolutionary change in the way that free software is developed and distributed. [[Reminds me of Kragen’s “[What’s wrong with HTTP?](http://lists.canonical.org/pipermail/kragen-tol/2006-November/000841.html)” article. —Ed.]]

GitTorrent makes Git truly distributed. The initial plans are for reducing mirror loading, however the full plans include totally distributed development: no central mirrors whatsoever. PGP signing and other web-of-trust-based mechanisms will take over from protocols on ports (e.g. ssh) as the access control "clearing house".

Collaborative Map-Reduce in the Browser - igvita.com
http://www.igvita.com/2009/03/03/collaborative-map-reduce-in-the-browser/

After several iterations, false starts, and great conversations with Michael Nielsen, a flash of the obvious came: HTTP + Javascript! What if you could contribute to a computational (Map-Reduce) job by simply pointing your browser to a URL? Surely your social network wouldn't mind opening a background tab to help you crunch a dataset or two!

Are Cloud Based Memory Architectures the Next Big Thing? | High Scalability
http://highscalability.com/are-cloud-based-memory-architectures-next-big-thing

We are on the edge of two potent technological changes: Clouds and Memory Based Architectures. This evolution will rip open a chasm where new players can enter and prosper. Google is the master of disk. You can't beat them at a game they perfected. Disk based databases like SimpleDB and BigTable are complicated beasts, typical last gasp products of any aging technology before a change. The next era is the age of Memory and Cloud which will allow for new players to succeed. The tipping point is soon. Let's take a short trip down web architecture lane: # It's 1993: Yahoo runs on FreeBSD, Apache, Perl scripts and a SQL database # It's 1995: Scale-up the database. # It's 1998: LAMP # It's 1999: Stateless + Load Balanced + Database + SAN # It's 2001: In-memory data-grid. # It's 2003: Add a caching layer. # It's 2004: Add scale-out and partitioning. # It's 2005: Add asynchronous job scheduling and maybe a distributed file system. # It's 2007: Move it all into the cloud. # It's 2008: Cloud +

What makes Memory Based Architectures different from traditional architectures is that memory is the system of record. Also discussed Jim Starkey NimbusDB

MIT’s Introduction to Algorithms, Lectures 20 and 21: Parallel Algorithms - good coders code, great reuse
http://www.catonmat.net/blog/mit-introduction-to-algorithms-part-thirteen/

Lectures

This is the thirteenth post in an article series about MIT’s lecture course “Introduction to Algorithms.” In this post I will review lectures twenty and twenty-one on parallel algorithms. These lectures cover the basics of multithreaded programming and multithreaded algorithms.

Cloudera's Basic Hadoop Training | Cloudera
http://www.cloudera.com/hadoop-training-basic

Cloudera's Basic Hadoop Training is available online, free of charge. If you have questions about the content, please feel free to direct them to community support. Note: The activities and tutorials suggest downloading our virtual machine (VM). They all use the same VM, so if you download it once, there is no need to do so again.

Message Queue Evaluation Notes - Second Life Wiki
http://wiki.secondlife.com/wiki/Message_Queue_Evaluation_Notes
John Resig - JavaScript Testing Does Not Scale
http://ejohn.org/blog/javascript-testing-does-not-scale/

TestSwarm is still a work in progress but I hope to open up an alpha test by the end of this month. Its construction is very simple. It's a dumb JavaScript client that continually pings a central server looking for more tests to run. The server collects test suites and sends them out to the respective clients.

Some Notes on Distributed Key Stores « random($foo)
http://randomfoo.net/2009/04/20/some-notes-on-distributed-key-stores

Distributed Key Stores

(Anti RDBMS) Key-value stores

Jedi/Sector One's random thoughts - An overview of modern SQL-free databases
http://00f.net/2009/an-overview-of-modern-sql-free-databases
distributed systems primer :: snax
http://blog.evanweaver.com/articles/2009/05/04/distributed-systems-primer/

I've been reading a bunch of papers about distributed systems recently, in order to help systematize for myself the thing that we built over the last year. Many of them were originally passed to me by Toby DiPasquale. Here is an annotated list so everyone can benefit. It helps if you have some algorithms literacy, or have built a system at scale, but don't let that stop you.

Performance comparison: key/value stores for language model counts - Brendan O'Connor's Blog
http://anyall.org/blog/2009/04/performance-comparison-keyvalue-stores-for-language-model-counts/

The first one is to use an in-memory data store, and communicate using the memcached protocol. This is, of course, *exactly* comparable to Memcached — behaviorally indistinguishable! — and it does worse. The second option is to do that, except switch to an on-disk data store. It’s pretty ridiculous that that’s still the same speed — communication overhead is completely dominating the time. Fortunately, Tyrant comes with a binary protocol. Using that substantially improves performance past Memcached levels, though less than a direct in-process database. Yes, communication across processes incurs overhead. No news here, I guess.

"Tokyo Tyrant is a server implemented on top of Cabinet that implements a similar key/value API except over sockets. It’s incredibly flexible; it was very easy to run it in several different configurations. The first one is to use an in-memory data store, and communicate using the memcached protocol. This is, of course, *exactly* comparable to Memcached — behaviorally indistinguishable! — and it does worse. The second option is to do that, except switch to an on-disk data store. It’s pretty ridiculous that that’s still the same speed — communication overhead is completely dominating the time. Fortunately, Tyrant comes with a binary protocol. Using that substantially improves performance past Memcached levels, though less than a direct in-process database. Yes, communication across processes incurs overhead. No news here, I guess."

celery - Distributed Task Queue for Django. — Celery v0.3.5 (unstable) documentation
http://ask.github.com/celery/introduction.html

celery is a distributed task queue framework for Django. It is used for executing tasks asynchronously, routed to one or more worker servers, running concurrently using multiprocessing.

Project Voldemort Blog : Building a terabyte-scale data cycle at LinkedIn with Hadoop and Project Voldemort
http://project-voldemort.com/blog/2009/06/building-a-1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/

Not one of those "we're using hadoop, now we're cool" articles. Well written!

Hadoop

erikfrey's bashreduce at master - GitHub
http://github.com/erikfrey/bashreduce/tree/master

whoah, wtf.

Map/Reduce in a bash script... hahahahahahaha

MapReduce done in BASH! Awesome!

Some mad bash magic for distributing stuff.

interesting hack -- apply Map-Reduce idioms to UNIX command lines across multiple machines or cores (via jzawodny, who's obviously looking at a lot of command line stuff recently ;)

braindump: NOSQL debrief
http://blog.oskarsson.nu/2009/06/nosql-debrief.html

NOSQL debrief

braindump: NOSQL debrief

First ever meeting of the NoSQL community. Lists all the presentations that were given.

Test Swarm
http://testswarm.com/

Distributed Continuous Integration for JavaScript

TestSwarm is a way for distributing JavaScript test suites to many browsers on many platforms - so you can get your results in a distributed manner.

up and running with cassandra :: snax
http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/

Cassandra is a hybrid non-relational database in the same class as Google's BigTable. It is more featureful than a key/value store like Dynomite, but supports fewer query types than a document store like MongoDB. Cassandra was started by Facebook and later transferred to the open-source community. It is an ideal runtime database for web-scale domains like social networks.

Facebook, Hadoop, and Hive | DBMS2 -- DataBase Management System Services
http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/

Just wanted to add that even though there is a single point of failure the reliability due to software bugs has not been an issue and the dfs Namenode has been very stable. The Jobtracker crashes that we have seen are due to errant jobs - job isolation is not yet that great in hadoop and a bad query from a user can bring down the tracker (though the recovery time for the tracker is literally a few minutes). There is some good work happening in the community though to address those issues.

I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.

HadoopDB Project
http://db.cs.yale.edu/hadoopdb/hadoopdb.html

An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.

HadoopDB is: 1. A hybrid of DBMS and MapReduce technologies that targets analytical workloads 2. Designed to run on a shared-nothing cluster of commodity machines, or in the cloud 3. An attempt to fill the gap in the market for a free and open source parallel DBMS 4. Much more scalable than currently available parallel database systems and DBMS/MapReduce hybrid systems. 5. As scalable as Hadoop, while achieving superior performance on structured data analysis workloads

DBMS Musings: Announcing release of HadoopDB (longer version)
http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html

my students Azza Abouzeid and Kamil Bajda-Pawlikowski developed HadoopDB. It's an open source stack that includes PostgreSQL, Hadoop, and Hive, along with some glue between PostgreSQL and Hadoop, a catalog, a data loader, and an interface that accepts queries in MapReduce or SQL and generates query plans that are processed partly in Hadoop and partly in different PostgreSQL instances spread across many nodes in a shared-nothing cluster of machines. In essence it is a hybrid of MapReduce and parallel DBMS technologies. But unlike Aster Data, Greenplum, Pig, and Hive, it is not a hybrid simply at the language/interface level. It is a hybrid at a deeper, systems implementation level. Also unlike Aster Data and Greenplum, it is free and open source.

Gearman
http://gearman.org/

# Reverse Worker Code $worker= new GearmanWorker(); $worker->addServer(); $worker->addFunction("reverse", "my_reverse_function"); while ($worker->work()); function my_reverse_function($job) { return strrev($job->workload()); }

Gearman provides a generic application framework to farm out work to other machines or processes that are better suited to do the work. It allows you to do work in parallel, to load balance processing, and to call functions between languages. It can be used in a variety of applications, from high-availability web sites to the transport of database replication events. In other words, it is the nervous system for how distributed processing communicates.

language independent worker framework

Last.fm – the Blog · Mapreduce Bash Script
http://blog.last.fm/2009/04/06/mapreduce-bash-script

One night at the pub we discussed whether one could replace Hadoop (a massive and comprehensive implementation of Mapreduce) with a single bash script, an awk command, sort, and a sprinkling of netcat. This turned into a weekend project dubbed bashreduce.

Hardcoded version of push

Map-Reduce implemented as a bash script!

MapReduce in a Bash Script

One night at the pub we discussed whether one could replace Hadoop (a massive and comprehensive implementation of Mapreduce) with a single bash script, an awk command, sort, and a sprinkling of netcat. This turned into a weekend project dubbed bashreduce

Map Reduce implemented in bash using sort, awk, grep, join.

Riak - A Decentralized Database
http://riak.basho.com/

Riak combines a decentralized key-value store, a flexible map/reduce engine, and a friendly HTTP/JSON query interface to provide a database ideally suited for Web applications.

GFS: Evolution on Fast-forward - ACM Queue
http://queue.acm.org/detail.cfm?id=1594206

Google File System

ACM Queue, August 7, 2009

Is a Perfect Storm Forming For Distributed Social Networking?
http://www.readwriteweb.com/archives/is_a_perfect_storm_forming_for_distributed_social_networking.php

is the time right for distributed net as Dave Winer sugests? links to various tools that could be really useful

"Maybe it's better to host your own. That's the thinking coming from a growing number of early technology adopters as service after service goes down, sells out or otherwise frustrates the users who have published their content online only to see the tools they use become broken or less desirable." One word (camelCased): BuddyPress.

XtreemFS - file systems for the masses - a replicated and distributed file system for the internet and cloud storage
http://www.xtreemfs.org/
WTF is a SuperColumn? An Intro to the Cassandra Data Model — Arin Sarkissian
http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-model

Nice detailed examples on NoSQL data modeling in Cassandra.

Home - testswarm - GitHub
http://wiki.github.com/jeresig/testswarm

TestSwarm provides distributed continuous integration testing for JavaScript. It was initially created by John Resig as a tool to support the jQuery project and has since moved to become an official Mozilla Labs project.

Javascript testing framework.

TestSwarm provides distributed continuous integration testing for JavaScript.

InfoQ: Clojure and Rails - the Secret Sauce Behind FlightCaster
http://www.infoq.com/articles/flightcaster-clojure-rails

Clojure is a LISP for the JVM created by Rich Hickey.

FlightCaster, a realtime flight delay site, is built on Clojure and Hadoop for the statistical analysis. The web frontend is built with Ruby on Rails and hosted on Heroku. We talked to Bradford Cross about Clojure, functional programming and tips for OOP developers interested in making the jump.

Another critical piece of infrastructure is Cascading; an excellent layer on top of Hadoop that adds additional abstraction and functionality. We definitely recommend Cascading to anyone doing serious data processing and mining with Hadoop.

swarm-dpl - Project Hosting on Google Code
http://code.google.com/p/swarm-dpl/

Swarm is a framework allowing the creation of web applications which can scale transparently through a novel portable continuation-based approach. Swarm embodies the maxim "move the computation, not the data".

The Anatomy of Hadoop I/O Pipeline (Hadoop and Distributed Computing at Yahoo!)
http://developer.yahoo.net/blogs/hadoop/2009/08/the_anatomy_of_hadoop_io_pipel.html
NoSQL: Distributed and Scalable Non-Relational Database Systems | Linux Magazine
http://www.linux-mag.com/cache/7579/1.html

From @jesserobbins

Non-SQL oriented distributed databases are all the rage in some circles. They’re designed to scale from day 1 and offer reliability in the face of failures.

NoSQL: Distributed and Scalable Non-Relational Database Systems

l

Cassandra and Ruby: A Love Affair? | Engine Yard Blog
http://www.engineyard.com/blog/2009/cassandra-and-ruby-a-love-affair/

"Most of today’s up and coming key-value stores are more than just simple key-value stores. You saw this when we looked at Tokyo Cabinet which, in addition to simple key-value capabilities, adds more sophisticated abilities, such as database-like tables. In this post we’ll look at Cassandra — a modern key-value store that continues this trend. Cassandra was originally developed by Facebook and released to open source last year. The Facebook team describes Cassandra as (Google) BigTable running on top of an Amazon Dynamo-like infrastructure."

Most of today's and up and coming key-value stores are more than just simple key-value stores. Cassandra is a modern key-value store that continues this trend.

Why I like Redis
http://simonwillison.net/2009/Oct/22/redis/

Like mongodb but lives in memory with replication and periodic store-to-disk. Like memcached but with data structures. Great for non-critical data or replicated critical data.

Scaling Memcached: 500,000+ Operations/Second with a Single-Socket UltraSPARC T2 - Parallelism on the Brain
http://blogs.sun.com/zoran/entry/scaling_memcached_500_000_ops

A software-based distributed caching system such as memcached is an important piece of today's largest Internet sites that support millions of concurrent users and deliver user-friendly response times. The distributed nature of memcached design transforms 1000s of servers into one large caching pool with gigabytes of memory per node. This blog entry explores single-instance memcached scalability for a few usage patterns.

"A software-based distributed caching system such as memcached is an important piece of today's largest Internet sites that support millions of concurrent users and deliver user-friendly response times. The distributed nature of memcached design transforms 1000s of servers into one large caching pool with gigabytes of memory per node. This blog entry explores single-instance memcached scalability for a few usage patterns."

Schneier on Security: Self-Enforcing Protocols
http://www.schneier.com/blog/archives/2009/08/self-enforcing.html

Notes on methods to eliminate corruption in a system by making honesty the most advantageous course of action

"Here’s a self-enforcing protocol for determining property tax: the homeowner decides the value of the property and calculates the resultant tax, and the government can either accept the tax or buy the home for that price. Sounds unrealistic, but the Greek government implemented exactly that system for the taxation of antiquities. It was the easiest way to motivate people to accurately report the value of antiquities."

Rackspace Cloud Computing & Hosting | NoSQL Ecosystem
http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/

Good introduction to the "NoSQL" space (initially not a fan of the term, but I guess it is going to stick...), highlighting the different designs used by the options in the space, and the benefits/drawbacks of those designs.

Unprecedented data volumes are driving businesses to look at alternatives to the traditional relational database technology that has served us well for over thirty years. Collectively, these alternatives have become known as “NoSQL databases.”

Jonathan Ellis's Programming Blog - Spyced: CouchDB: not drinking the kool-aid
http://spyced.blogspot.com/2008/12/couchdb-not-drinking-kool-aid.html

Poor SQL; even with DSLs being the new hotness, people forget that SQL is one of the original domain-specific languages. It's a little verbose, and you might be bored with it, but it's much better than writing low-level mapreduce code.

Pragmatic Programming Techniques: NOSQL Patterns
http://horicky.blogspot.com/2009/11/nosql-patterns.html

A nice overview of some of the more popular patterns in NoSQL architecture

BERT and BERT-RPC 1.0 Specification
http://bert-rpc.org/

BERT and BERT-RPC are an attempt to specify a flexible binary serialization and RPC protocol that are compatible with the philosophies of dynamic languages such as Ruby, Python, PERL, JavaScript, Erlang, Lua, etc. BERT aims to be as simple as possible while maintaining support for the advanced data types we have come to know and love. BERT-RPC is designed to work seamlessly within a dynamic/agile development workflow. The BERT-RPC philosophy is to eliminate extraneous type checking, IDL specification, and code generation. This frees the developer to actually get things done.

"BERT and BERT-RPC are an attempt to specify a flexible binary serialization and RPC protocol that are compatible with the philosophies of dynamic languages such as Ruby, Python, PERL, JavaScript, Erlang, Lua, etc. BERT aims to be as simple as possible while maintaining support for the advanced data types we have come to know and love. BERT-RPC is designed to work seamlessly within a dynamic/agile development workflow. The BERT-RPC philosophy is to eliminate extraneous type checking, IDL specification, and code generation. This frees the developer to actually get things done."

Hadoop - YDN
http://developer.yahoo.com/hadoop/

"Apache Hadoop* is an open source Java software framework for running data-intensive applications on large clusters of commodity hardware."

Hadoop and Distributed Computing at Yahoo!

hazelcast - Project Hosting on Google Code
http://code.google.com/p/hazelcast/

Hazelcast is a clustering and highly scalable data distribution platform for Java. Features: Distributed implementations of java.util.{Queue, Set, List, Map} Distributed implementation of java.util.concurrency.locks.Lock Distributed implementation of java.util.concurrent.ExecutorService Distributed MultiMap for one-to-many relationships Distributed Topic for publish/subscribe messaging Transaction support and J2EE container integration via JCA Socket level encryption support for secure clusters Synchronous (write-through) and asynchronous (write-behind) persistence Second level cache provider for Hibernate Monitoring and management of the cluster via JMX Dynamic HTTP session clustering Support for cluster info and membership events Dynamic discovery Dynamic scaling Dynamic partitioning with backups Dynamic fail-over

Hazelcast is a clustering and highly scalable data distribution platform for Java.

Hazelcast is a clustering and highly scalable data distribution platform for Java. Features: * Distributed implementations of java.util.{Queue, Set, List, Map} * Distributed implementation of java.util.concurrency.locks.Lock * Distributed implementation of java.util.concurrent.ExecutorService * Distributed MultiMap for one-to-many relationships * Distributed Topic for publish/subscribe messaging * Transaction support and J2EE container integration via JCA * Socket level encryption support for secure clusters * Synchronous (write-through) and asynchronous (write-behind) persistence * Second level cache provider for Hibernate * Monitoring and management of the cluster via JMX * Dynamic HTTP session clustering * Support for cluster info and membership events * Dynamic discovery * Dynamic scaling * Dynamic partitioning with backups * Dynamic fail-over Hazelcast is for you if you want to * share data/state among many s

data distribution platform

Distributed Logging: Syslog-ng & Splunk - igvita.com
http://www.igvita.com/2008/10/22/distributed-logging-syslog-ng-splunk/

stream live logs from your Ruby, Haproxy, and Nginx processes into your Splunk database for easy debugging and profiling. Of course, same procedures apply to any other process on a remote server - make it log to syslog, and you can route it to Splunk!

Introducing Redis: a fast key-value database | Zen and the Art of Programming
http://antoniocangiano.com/2009/03/11/introducing-redis-a-key-value-database/
Product: Scribe - Facebook's Scalable Logging System | High Scalability
http://highscalability.com/product-scribe-facebooks-scalable-logging-system
assertTrue( ): NoSQL Required Reading
http://asserttrue.blogspot.com/2009/12/nosql-required-reading.html

Starting from Dynamo, ending with (roughly) follow @nosqlupdate on Twitter.

Materials that you need to read in order to get started with NoSQL

List of resources to read to get up-to-speed on the NoSQL movement.

クックパッドとHadoop « クックパッド開発者ブログ
http://techlife.cookpad.com/2009/09/16/cookpad-hadoop-introduction/

わかりやすい資料。

クックパッドとHadoop « クックパッド開発者ブログはじめまして。今年の5月に入社した勝間@さがすチームです。入社してからは、なかなか大変なことも多いですが、最近はお酒好きが集まって月曜から飲み合う「勝間会」なるものも発足して、仕事面でも仕事以外の面でも密度の高い毎日を過ごしています！さて、僕は「さがす」チーム所属ということで、普段はレシピを「さがす」ユーザの満足度を上げるために、クックパッドの検索まわりについて、いろいろな開発を行っていま... はてなブックマーク - クックパッドとHadoop « クックパッド開発者ブログはてなブックマークに追加 dann dann hadoop

もう1つの、DBのかたち、分散Key-Valueストアとは (1/3) - ＠IT
http://www.atmarkit.co.jp/fjava/rensai4/bigtable01/01.html

キーバリューストアの解説「CAP定理」では、分散システムで以下の3つを同時に保証することは不可能であることが示されています。 * データの整合性（Consistency） * データの可用性（Availability） * データの分散化（Partition-tolerance）

>RDBとは別の、クラウド時代のデータベースとして注目を浴びている「分散Key-Valueストア」。その本命ともいえる、Googleの数々のサービスの基盤技術「Bigtable」について徹底解説どうかなあ…

Bigtable, SimpleDB, Tokyo Tyrant

Google Technology RoundTable: Map Reduce
http://research.google.com/roundtable/MR.html

Matt is also the author of

Consensus Protocols: Two-Phase Commit at Paper Trail
http://hnr.dnsalias.net/wordpress/?p=90

Nice article on 2pc

terrastore - Project Hosting on Google Code
http://code.google.com/p/terrastore/
octo.py: quick and easy MapReduce for Python
http://ebiquity.umbc.edu/blogger/2009/01/02/octopy-quick-and-easy-mapreduce-for-python/

octo.py: quick and easy MapReduce for Python

showcases an example of using the mapreduce system octo.py

Bytepawn - Scalable Web Architectures and Application State
http://bytepawn.com/2009/06/17/scalable-web-architectures-and-application-state/

Note about Code-State-Cache-Data (CSCD) pattern in scalable web applications.

Short Article propounding the use of a "Code-State-Cache-Data-Architecture" (CSCD) instead of just CD or CCD applications. Basically saying that you should forget about stateful apps if you wan't maximum performance...

Application state - Data you can restore from the database or afford to lose if server is restarted (logged in users). He recommends storing this in-memory. "Application state goes into an in-memory key-value store like Tokyo Tyrant. Cache data goes into Memcached. Persistent data goes into a database"

"What he needs is the insight to identify state, cached data and persistent data in his application. Application state goes into an in-memory key-value store like Tokyo Tyrant. Cache data goes into Memcached. Persistent data goes into a database. Note that the seperation of code and application state may be beneficial later, because it allows you to scale easily by adding new memory servers. ... Let's call this the Code-State-Cache-Data (CSCD) pattern. What Damian originally had was a Code-Data (CD) pattern, and later he optimized to get a Code-Cache-Data (CCD) pattern"

Elastic Search - Open Source, Distributed, RESTful Search Engine
http://www.elasticsearch.com/

A distributed, highly available RESTful Search Engine based on Lucene

ElasticSearch is an open source, distributed, RESTful search engine built on top of Lucene.

ElasticSearch is an open source, distributed, RESTful, Search Engine.

An open source search engine built on top of Lucene that can be distributed across multiple indexes on different nodes.

Beanstalkd / Python Basic Tutorial - Standard Deviations
http://parand.com/say/index.php/2008/10/12/beanstalkd-python-basic-tutorial/

Beanstalkd is an in-memory queuing system. It supports named queues (called ‘tubes’), priorities, and delayed delivery of messages. Terminology: a message is called a job, and queues are called tubes

c = serverconn.ServerConn('localhost', 99988)

Hadoop Live CD at OpenSolaris.org
http://opensolaris.org/os/project/livehadoop/

OpenSolaris Project: Hadoop Live CD

A Comparison of Approaches to Large-Scale Data Analysis - MapReduce vs. DBMS Benchmarks
http://database.cs.brown.edu/sigmod09/

"The following information is meant to provide documentation on how others can recreate the benchmark trials used in our SIGMOD 2009 paper."

A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

PiCloud | Cloud Computing. Simplified.
http://www.picloud.com/

import a lib in your python and run code automagically in parallel on a remote cluster. pricing is time used x processes plus data xfer

Cassandra @ Twitter: An Interview with Ryan King « MyNoSQL
http://nosql.mypopescu.com/post/407159447/cassandra-twitter-an-interview-with-ryan-king

RT @kvz: Why Twitter is dropping MySQL in favor of Cassandra: http://bit.ly/dyeiXF

RT @DZone "Cassandra @ Twitter: An Interview with Ryan King « MyNoSQL" http://dzone.com/WbTY

MyNoSQL: Please include anything I’ve missed.

The Apache Cassandra Project
http://cassandra.apache.org/

une base données massivement parallèle et avec l'esprit "bigtable", provient de facebook

The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model.

Akka Project
http://akkasource.org/

Java Platform (AKA, JVM) library/framework for distributed, fault-tolerant system through Actors. Scala and Java APIs. Found through Dean Wampler. Project tag line is "Simpler Scalability, Fault-Tolerance, Concurrency & Remoting through Actors"

Simpler Scalability, Fault-Tolerance, Concurrency & Remoting through Actors (Erlang con API Java e Scala)

Kazuho@Cybozu Labs: Pacific という名前の分散ストレージを作り始めた件
http://developer.cybozu.co.jp/kazuho/2009/06/pacific-18c7.html
WTF is a SuperColumn? An Intro to the Cassandra Data Model — Arin Sarkissian
http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model

Introductory blog post about the Cassandra data model.

Gojko Adzic » Improving performance and scalability with DDD
http://gojko.net/2009/06/23/improving-performance-and-scalability-with-ddd/

Distributed systems are not typically a place where domain driven design is applied. Distributed processing projects often start with an overall architecture vision and an idea about a processing model which basically drives the whole thing, including object design if it exists at all. Elaborate object designs are thought of as something that just gets in the way of distribution and performance, so the idea of spending time to apply DDD principles gets rejected in favour of raw throughput and processing power. However, from my experience, some more advanced DDD concepts can significantly improve performance, scalability and throughput of distributed systems when applied correctly.

One of the most important building blocks of DDD that can help in distributed systems are aggregates. Unfortunately, at least judging by the discussions that I’ve had with client teams over the last few years, aggregates seem to be one of the most underrated and underused building blocks of DDD. I’m probably as guilty as anyone else of misusing aggregates and it took me quite a while to grasp the full potential of that concept. But from this perspective I think that getting aggregates just right is key to making a distributed system work and perform as expected.

Geeking with Greg: Jeff Dean keynote at WSDM 2009
http://glinden.blogspot.com/2009/02/jeff-dean-keynote-at-wsdm-2009.html

Google Fellow Jeff Dean gave an excellent keynote talk at the recent WSDM 2009 conference that had tidbits on Google I had not heard before. Particularly impressive is Google's attention to detail on performance and their agility in deployment over the last decade.

CS264: Peer-to-Peer Systems
http://www.eecs.harvard.edu/~mema/courses/cs264/cs264.html#schedule
Coding Horror: On Working Remotely
http://www.codinghorror.com/blog/2010/05/on-working-remotely.html

What we did last week What we're planning to do this week Anything that is blocking us or we are concerned about

HBase vs Cassandra: why we moved « Bits and Bytes.
http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/

HBase vs Cassandra: why we moved

The Twitter Engineering Blog: Introducing Gizzard, a framework for creating distributed datastores
http://engineering.twitter.com/2010/04/introducing-gizzard-framework-for.html
join diaspora
http://www.joindiaspora.com/

Projeto de rede social aberta. Futuro? A se pensar...

A possible alternative to Facebook

High Scalability - High Scalability - 7 Lessons Learned While Building Reddit to 270 Million Page Views a Month
http://highscalability.com/blog/2010/5/17/7-lessons-learned-while-building-reddit-to-270-million-page.html

7 Lessons Learned While Building Reddit to 270 Million Page Views a Month

README - redis - Google Code
http://code.google.com/p/redis/wiki/README

a database implementing a dictionary, where every key is associated with a value. every single value has a type. The following types are supported: * Strings * Lists * Sets * Sorted Set (since version 1.1)

maybe the guy is not suitable to address such compare?

Persistent in-memory key value database compared to memcached

tructures and algorithms. Indeed both algorithms and data structures in Redis are properly choosed in order to obtain the best performance.

ongoing · The Web vs. the Fallacies
http://www.tbray.org/ongoing/When/200x/2009/05/25/HTTP-and-the-Fallacies-of-Distributed-Computing

Here at Sun, the Fallacies of Distributed Computing have long been a much-revered lesson. Furthermore, I personally think they’re pretty much spot-on. But these days, you don’t often find them coming up in conversations about building big networked systems. The reason is, I think, that we build almost everything on Web technologies, which lets get away with believing some of them.

via rtomayko

If you’re building Web technology, you have to worry about these things. But if you’re building applications on it, mostly you don’t. ¶ Well, except for security; please don’t stop worrying about security

Data-Intensive Text Processing with MapReduce
http://www.umiacs.umd.edu/~jimmylin/book.html
Nicholas Piël » ZeroMQ an introduction
http://nichol.as/zeromq-an-introduction

ZeroMQ is a messaging library, which allows you to design a complex communication system without much effort.

ZeroMQ is a messaging library, which allows you to design a complex communication system without much effort. It has been wrestling with how to effectively describe itself in the recent years. In the beginning it was introduced as ‘messaging middleware’ later they moved to ‘TCP on steroids’ and right now it is a ‘new layer on the networking stack’. I had some trouble understanding ZeroMQ at first and really had to reset my brain. First of all, it is not a complete messaging system such as RabbitMQ or ActiveMQ. I know the guys of Linden Research compared them, but it is apples and oranges. A full flexed messaging system gives you an out of the box experience. Unwrap it, configure it, start it up and you’re good to go ones you have figured out all its complexities. ZeroMQ is not such a system at all; it is a simple messaging library to be used programmatically. It basically gives you a pimped socket interface allowing you to quickly build your own messaging system.

Libreria para comunicaciones

I had some trouble understanding ZeroMQ at first and really had to reset my brain. First of all, it is not a complete messaging system such as RabbitMQ or ActiveMQ. I know the guys of Linden Research compared them, but it is apples and oranges. A full flexed messaging system gives you an out of the box experience. Unwrap it, configure it, start it up and you’re good to go once you have figured out all its complexities. ZeroMQ is not such a system at all; it is a simple messaging library to be used programmatically. It basically gives you a pimped socket interface allowing you to quickly build your own messaging system.

#ZeroMQ an introduction - http://goo.gl/Za3t #python #messaging

Bitcoin P2P Cryptocurrency | Bitcoin
http://www.bitcoin.org/

RT @draenews: Del Bitcoin P2P Cryptocurrency | Bitcoin: http://www.bitcoin.org/

Bitcoin is a peer-to-peer network based digital currency. Peer-to-peer (P2P) means that there is no central authority to issue new money or keep track of transactions. Instead, these tasks are managed collectively by the nodes of the network.

Hmm P2P encryption online free banking service, looks very insecure

Understanding and Applying Operational Transformation - Code Commit
http://www.codecommit.com/blog/java/understanding-and-applying-operational-transformation?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+codecommit+%28Code+Commit%29

@djspiewak wrote a very detailed intro to operational transformation. Very useful for building, say, a collab editor

Almost exactly a year ago, Google made one of the most remarkable press releases in the Web 2.0 era. Of course, by “press release”, I actually mean keynote at their own conference, and by “remarkable” I mean potentially-transformative and groundbreaking. I am referring of course to the announcement of Google Wave, a real-time collaboration tool which has been in open beta for the last several months.

Good article explaining how the Operational Transform from Google Wave can be implemented, and the various cases that have to be handled when server and client both have edits pending.

The algorithm behind "Wave"