Scalable component-based OS construction


A research OS focusing on low-latency predictability, security, and reliability, that scales from resource constrained microcontrollers, up to massively parallel systems.

Find Out More

OS research for a changing world


OSes have difficulties both scaling up to very large-scale systems, and scaling down to the smallest of IoT devices. This is especially true when providing strength in non-functional constraints such as security, reliability, predictability, and efficiency. Composite provides abstractions and mechanisms tailored toward a world in which tail-latency, system resiliency, and dependability are increasingly important.

Try it out!

Key principles:

  • Component-based policy. System policies for resource management should be defined in user-level components. There are two key implications of this principle: 1. Similar to the microkernel principle of minimality, simplified as "policies that can be implemented in user-level, should be". 2. Resources should be managed by separate components with orthogonal implementations, where possible. This principle is the foundation for the component-based design of the overall system.

  • End-to-end execution attribution. It is difficult to understand who is using each resource at all points in time, as a "client" can make many transitive requests through many components (for example, communicating with a socket component, then TCP, then scheduling). To provide full accountability and resource management (e.g. for priority management for shared resources), end-to-end tracking of principals is required.

  • Principle of least privilege. This principle is a foundation for secure systems. Each body of code should have access to the resources needed to carry out its goals, and only those resources. This implies the fine-grained decomposition of system software into separate functional components, each with limited access to system resources. Composite takes this principle to an extreme by separating even the lowest-level traditional policies (including scheduling, synchronization, and memory management) into separate policies.

  • Scalable predictability. Predictable execution is that for which we can determine and provide an execution bound. This is an essential non-functional requirement of any real-time system, but is increasingly important as massive servers concern themselves with lowering service tail-latencies. Composite is designed to provide predictable execution even with an increasing number of cores by avoiding locks and shared, modified cache-lines.

Composite Differentiating Technologies


Lock-less Kernel

The kernel has no locks, and avoids the contention that threatens scalability to very large numbers of cores. Composite uses quiescence techniques, along with wait-free algorithms to mediate contention, and namespace partitioning to avoid it where possible.

Capability-based Security

Access control pervasively controls access to fine-grained system resources, including not only memory and kernel abstractions such as threads, but also computation time itself. Capability propagation policies are controlled by user-level policies, simplifying kernel design.

User-level Scheduling

The kernel contains no scheduling policy, instead defining system-wide scheduling in isolated user-level components, thus enabling heightened configurability, isolation, and performance.

Thread-Migration-based IPC

Inter-process communication performance is the cornerstone of microkernel design. To enable high performance, fine-grained decomposition of the system into components, and control over end-to-end thread latencies, Composite uses a highly optimized form of thread migration.


Research Contributions

The unique technologies in Composite have been leveraged in a number of research domains. A sampling of our high-level contributions include:

  • Efficient coordination and scheduling. The first reaction to the Composite design that requires the user-level definition of scheduling policy, and emphasizes inter-protection domain communication is that it must be slow. Each activation of the scheduler, and each instance of communication requires mode-switches, and page-table switches (see the fundamentals involved). Though unintuitive, these hardware overheads can be less than the software overheads that conventional (even highly optimized) systems impose on the system. Correspondingly, thread switch operations, and communication primitives take a small fraction of a micro-second in Composite, and become insignificant sources of overhead for non-trivial applications and system services.

  • Microcontroller/IoT virtualization. Microcontrollers with 16-256KB of SRAM typically do not have the memory management unit (MMU) facilities of more complex processors. Instead, they provide memory protection unit (MPU)-based protection that is much more limited, and doesn't virtualize addresses (i.e. no virtual to physical translation). To enable secure and reliable IoT device execution, Composite provides an efficient abstraction across both MMUs and MPUs to provide memory protection. Further, it provides a virtualization infrastructure over it that enables isolation across CPU, memory, and I/O dimensions. This strong isolation is required for the next generation of low-power, inexpensive, IoT and cyber-physical devices. For more information, see our paper.

  • Fault tolerance for system-level services. Building reliable systems is quite difficult. Fault tolerance techniques acknowledge that faults will happen -- whether due to cosmic rays changing architectural state (Single Event Upsets, or SEUs), or intermittent software bugs (races) -- and that techniques can be put into place to recover from those faults. Of note, these techniques are required for high-confidence system, even if software if completely verified due to the hardware-level errors involved. This is more difficult in real-time systems in which a physical system is being controlled. Control loops often need to run at up to a 100 or 1000 Hz, and the recovery operation itself takes a significant amount of time. Checkpoint and restore in existing infrastructure takes 100s of ms (and a couple of magnitudes more in Xen), which cannot easily be integrated into the real-time control. Even a highly optimized version of checkpoint/restore (that executes at memcpy speeds) implemented in Composite cannot be integrated into real-time systems with moderate tight bounds. The Computational Crash Cart (C3) is based on the ability to reboot not the entire machine, but individual components in 10s of microseconds, and pars that mechanism with a system that tracks communication between components as state machines to recreate the state of a failed, and rebooted component. We've expanded this work by creating a language for generating this tracking code, thus providing reliability without explicitly writing recovery code; and Composite's tracking infrastructure was expanded to provide detection of latent faults, and to perform detection of statistical anomalies for real-time intrusion detection and recovery.

  • Predictable execution with inter-component communication. It is more difficult than it seems to provide predictable execution while handling the dependencies created when communicating between components. Thread migration-based communication links the execution of inter-protection domain requests to a single schedulable entity, thus allowing the end-to-end tracking and accounting of that execution. This is essential as the number of protection domains increase, thus the amount of communication (especially nested communication) becomes more prevalent (see the principle of least privilege). Thread migration, paired with the push of resource management policies into user-level, required investigations into stack management, and into shared memory-based message passing.

  • Access control for time. Often the security relationships between different subsystem requires that they use asynchronous communication to coordinate. This is similar to how VMs interact with Dom0 in Xen. We pioneered Temporal Capabilities (TCaps) which enable a way for different schedulers to functionally coordinate, while controlling the interference that other subsystems can cause on them. For example, when a "client" subsystem (e.g. a VM) requests services from a "server" subsystem (e.g. Dom0), how can the server properly attribute this execution to the client? How can the overall system scheduler properly schedule the server's computation while servicing the request so as to not unduly interfere with other VMs? How can the client VM prevent interference of that server execution from interfering with computation for even higher-priority computation within in itself? TCaps solve these problems for inter-scheduler domain coordination by enabling proper execution accounting and prioritization between schedulers to allow server computation to be done using client time, and to allow the client and broader system to control the interference of any given request on other computation. They are, generally, a useful mechanism. However, within Composite where many different domains, each with their own schedulers, must coordinate.

  • Scalable, predictable parallelism. The Composite kernel includes no locks to protect shared data-structures. Locks impose significant cache-coherency overhead, and can cause rare, large overheads that harm predictability and tail latencies. More-so, they limit the scalability of the system with an increasing number of cores and sockets. In contrast, Composite's data-structures are designed to be simple and amenable to wait-free implementations. This design leads to a kernel in which we can guarantee very low amounts of contention, thus scalability even in the worst-case -- a guarantee necessary for real-time and low-latency systems. We've provided the guarantee for our kernel, and for preemptive (user-level) execution. The latter shows more than a factor of ten improvements in memcached performance and significant improvements in tail latency.

  • Efficient, predictable parallel programming. OpenMP is a common parallel programming framework. FJOS is an OS focused on providing efficient fork-join-based parallelism in OpenMP. It aims to be as lightweight an abstraction around the core Composite operations and mechanisms. As such, it maintains up to more than a factor of five decrease in overheads, and huge gains in worst-case overheads. The worst case overheads of a blocking implementation in FJOS is on par with the spinning implementation in Linux. This enables the use of OpenMP in parallel real-time systems to push the bounds on what can be computed predictably.

  • Secure online voting system. Composite was used in the verifiable election based on Scantegrity for Takoma Park, MD. It provided a secure webpage for verifying ballots after the election. The system is based on the premise that the policies, abstractions, and mechanisms of the system, if tailored for the very specific task of providing a security bulletin board (BB) system, can be streamlined to produce a very specialized, secure, and reliable system. This is particularly important for verifiable voting systems that make the strong assumption that the BB is secure.

Who are we?

Most researchers working on Composite are at The George Washington University and part of Gabe Parmer's lab. If you're interested in doing a PhD focused on system design and implementation, please send Gabe an email and note that your interest came from this webpage.

Composite is open source!

Check it out on github!

Want to learn more?


We have an internal slack where most communication takes place. To kick off communication, you're best off emailing Gabe. More information can be found in our publications and in some of the background posts.

If you're interested in discussing the application of Composite in interesting domains, or are in a company interested in its goals, don't hesitate to email Gabe.

Sponsors


We'd like to sincerely thank our sponsors. The Composite Component-Based OS development effort has been supported by grants from the National Science Foundation (NSF) under awards CNS 1137973, CNS 1149675, and CNS 1117243, by The office of Naval Research (ONR) Awards No. N00014-14-1-0386, and ONR STTR N00014-15-P-1182 and N68335-17-C-0153, by Hewlett Packard Enterprise (HPE), ARM, SRC, CRA, and Lincoln Labs. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.

© Gabriel Parmer, 2013, all rights reserved, background image designed by Vector Open Stock