Thursday, January 15, 2009

Distributed File Systems - for Virtualization & Clouds

In order to distribute the workload and processing uniformly across the virtualized systems, one would need a different file system - like Hadoop's HDFS or PNFS (Parallel NFS).

Hadoop is a Apache java project that supports data intensive distributed applications. Realize that this only works for offline processing - like logs, search, business intelligence or ad placements. It enables applications to work with thousands of nodes and petabytes of data. The current tested implementations work across 4000 nodes - plan is to make it work for about 15,000 nodes.

Most notable users of Hadoop are Facebook, Yahoo, Joost, Google & Veoh. I will try to list various uses for Hadoop some other day!