Overview - A Distributed Operating System

A Distributed operating system is a system that operates a distributed collection of computers, a cluster. Some require a single controller, but that shouldn't be necessary and isn't an interesting part of the problem. The system runs programmes and manages resources. The system should completely abstract the view of the hardware away from the software.

Fault tolerance, scalability and redundancy

The cluster may be heterogeneous, nodes may go offline, new nodes may come online. These are all issues that the system should handle. The system should dynamically adapt to changes in its configuration while continuing to function without error. At all times the system should utilize the cluster to its full potential. Nothing of value should be lost when a node does go down, thus redundancy needs to be built into the system.

Abstraction

For a distributed operating system, abstraction of the whole computer cluster needs to be considered. The application programmer needs the ability to create threads, manage the communication between them and access resources. The actual distribution and synchronisation should to be managed by the system.

Intro to components

Compute power distribution

Distributing computational tasks between computers can be difficult or easy depending on the type of task being performed.

Some task fall into the embarrassingly parallel category. This is where there is lots of data items that needs the same processing done to each. Here the data can simply be partitioned and sent to different CPU’s to work on independently.

Other tasks fall into the non-embarrassingly parallel category. These tasks require more complex coordination to work in parallel. Non-embarrassingly parallel tasks traditionally use shared variables and require locking of variables.

In both cases there are different threads of execution work for some process. These threads will be distributed over different CPU’s and computer nodes within the cluster.

Requirements

The distributed operating system needs to abstract the hardware and run the programs. While some overhead will be accepted the efficiency of the system is a priority. Hardware abstraction has been done very well in modern computers. There are programming languages like C that compile for many different machine architecture’s and languages like Java that run in a virtual machine environment.

Resource management

Some sort of shared storage is required. For different applications a traditional file system or a database might be required. These would use the storage of all the nodes, while maintaining redundancy. The file system and database should behave like traditional implementations.

Compute power distribution

I propose the process’s and threads get a context that abstracts and simplifies their distribution and fault tolerance. These context’s will be similar to the contexts of the thread/process on a single computer operating system, with fault tolerant and scalability extensions. Contexts would define inter-thread communication and be associated with shared variables and thread locks. A global Context would exist for the entire process. Smaller Context groups could be defined for sub groups of threads within the process. The partitioning of threads will help the system optimize the layout of the nodes and limit the scope of shared resources reducing redundant copies being updated.

Inter thread communication, shared variables

It may be that a small group of threads need to work on a sub problem for a larger task. The parallelization would need shared variables. Normally this would require a lot of inter-thread communication, but through the use of these contexts that could be abstracted away. The threads would simply access the shared variable. The computer nodes that these threads work on should have minimal latency. By being in their own context group the system could know to put them close together. This is an example of a environment aware optimization.

Sometimes threads may need to wait for other threads. These waits could be managed by these thread context groups.

Fault tolerance

The thread contexts would ensure fault tolerance by state saving to some distributed memory. Then ensuring that the thread is still running with ‘heart beat’ messages. If the thread is found to have crashed another thread could be started from the last context save. Here the only thing of value lost is the computation time.

In traditional operating systems, threads are swapped out to allow multiple treads running on a single CPU. This is similar to the context saves in this distributed system. The same kind of information is being saved.

The context saves could be an expensive part of the distribution of tasks. Each save comes with a cost, but each crash comes with another cost lessened by the context save. Balancing the two costs could be quite difficult, especially for different cluster configurations. The distributed operating system should take care of how often context saves occur. By learning the cost of each save and how vulnerable the cluster is, a automated balancing algorithm could be implemented.

Locks

Locks should generally be avoided, especially in a redundant distributed context. A lot of locks could be able to avoided through the use of higher level data structures. For example In a producer consumer situation a thread safe queue maybe simpler to work with than a simple array. The array is a much simpler to implement, but isn’t thread safe so thread locking would be required, where the queue could be thread safe, requiring minimal locking from the clients.

Compare and swap style locking operations would also work well in a distributed system.

Traditional locks will may be required at some point, or may just be desired by application programmers, so they should be implemented. To ensure redundancy with traditional locks, the locks should have a TTL (time to live). Where if the TTL expires the lock invalidates leaving the rest of the program to continue. The thread with the lock could renew the TTL, to acquire more time with it. This could behave like heart beat messages to ensure fault tolerance.

Heterogeneous competition

With the context saves continually happening to the distributed system, two copies of the same thread context save could be started on different nodes. One may finish faster, the results from that node could be taken and the slower node could abandon its progress, as it would lead to the same result. This would be useful when the cluster is waiting for one thread to finish, which may be on a slower computer. This optimization was shown in the google map reduce paper [GMR] to greatly improve the overall performance of the cluster.

Embarrassingly parallel

The distribution of embarrassingly parallel tasks has been done well by the google map reduce [GMR] and the Apache hadoop project [AHD]. Similar map reduce functionality could easily be implemented within this cluster using these contexts. One thread could read the input and start worker threads which would send their results to reducer threads.

Resource management

File System

The distribution of the file system can be achieved by placing only part of the file system on each particular node called a brick. For redundancy duplicates of each brick would be replicated on different computer nodes.

The client using the file system would query which node a particular block was on then get that block from the respective node. This would all be packaged in the file system driver and appear like a normal file system to the applications.

The file system needs to support growing, having more storage added to it. This can be achieved by using a dynamic file mapping structure. Which would need to be distributed and redundant across the cluster.

Higher throughput could be achieved by striping the data across multiple servers. This is done by partitioning the data into blocks, where successive blocks go to different cluster nodes. This provides the speed when one block is physically being written or read from one disk, the next block can start being written / read from the next node. Other environmentally aware optimizations could also be implemented, such as file locality.

Shared memory

The system should store its context saves in memory. These will be broadcast to other nodes for redundancy. Not all nodes need to store the contexts for all other threads, just enough for redundancy. This could require lots of memory. But the treads can choose when they have reached a place to save state, ensuring that they only save whats required. To ensure context saves don't come out of order, they will have a timestamp. So if two computers compute the same context state, but one is further through its computation than the other, each thread will know how far through it is. The node receiving both will know what one to keep, the one with the higher time stamp.

The system will need to use a similar communication and shared memory structure for coordinating the execution of thread contexts. Care will need to be taken to ensure race conditions don't occur. The scheduling and starting of threads could be done in a two step phase. Where in the first phase the thread to start is scheduled, communicated to all nodes, and any race conditions are dealt with. The second phase takes place when the schedule is agreed upon, and the thread is actually started.

How close we are to building one.

Some of the components already exist, while others need work. Work needs to be done to package everything together for easy distribution.

Distributed File systems already exist, there is the Google file system [GFS] and the Gluster file system [GLS].

Redundant coordination already exists in some computer games. With multiple computers playing one network game, one computer is chosen as the server. If that server then goes offline the other computers re-elect a new server. This is similar to the coordination required for a distributed operating system.

Apache ZooKeeper also provides redundant distributed coordination software [AZK].

We have distributed hash tables and distributed databases like Apache Cassandra [ACD], which work like a distributed shared memory. However their usually for persistent storage not like fast volatile shared memory required for task coordination.

Redundant distributable task contexts have been done well for embarrassingly parallel tasks, with map reduce and hadoop. Building on my proposed model should work well, but needs more work to show that its feasible, the context saving needs to be proven to not be too expensive, then it needs to be implemented.

So how close. A year with a solid development team should be enough to put something together, using another distributed file system and data base. The big problem will be getting it popular enough to gain momentum. A lot of the companies that could provide the initial push of development seem to work in more of a embarrassingly parallel world, so wouldn't benefit from this.

References

[GMR] Jeffrey Dean & Sanjay Ghemawat (2004) MapReduce: Simplied Data Processing on Large Clusters. Google, Inc
[AHD] http://hadoop.apache.org
[GFS] Sanjay Ghemawat, Howard Gobioff & Shun-Tak Leung (2003) The Google File System. Google, Inc
[GLS] http://www.gluster.org/
[AZK] http://zookeeper.apache.org/
[ACD] http://cassandra.apache.org/