MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data.
The MADlib mission: to foster widespread development of scalable analytic skills, by harnessing efforts from commercial practice, academic research, and open-source development.
OpenRefine is a power tool that allows you to load data, understand it, clean it up, reconcile it to master database, and augment it with data coming from Freebase or other web sources. All with the comfort and privacy of your own computer.
RocksDB is an embeddable persistent key-value store for fast storage. RocksDB can also be the foundation for a client-server database but our current focus is on embedded workloads.
RocksDB builds on LevelDB to be scalable to run on servers with many CPU cores, to efficiently use fast storage, to support IO-bound, in-memory and write-once workloads, and to be flexible to allow for innovation
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.
InfluxDB is a time series, events, and metrics database. It’s written in Go and has no external dependencies. That means once you install it there’s nothing else to manage (like Redis, HBase, or whatever). It’s designed to be distributed and scale horizontially, but be useful even if you’re only running it on a single box.
spray is a suite of lightweight Scala libraries providing client- and server-side REST/HTTP support on top Akka.
We believe that, having chosen Scala (and possibly Akka) as primary tools for building software, you’ll want to rely on their power not only in your application layer but throughout the full (JVM-level) network stack. spray provides just that: a set of integrated components for all your REST/HTTP needs that let you work with idiomatic Scala (and Akka) APIs at the stack level of your choice, all implemented without any wrapping layers around “legacy” Java libraries.
sprays development is guided by the following principles:
Fully asynchronous, non-blocking
All APIs are fully asynchronous, blocking code is avoided wherever at all possible.
Actor- and Future-based
spray fully embraces the programming model of the platform it is built upon. Akka Actors and Futures are key constructs of its APIs.
Especially sprays low-level components are carefully crafted for excellent performance in high-load environments.
All dependencies are very carefully managed, sprays codebase itself is kept as lean as possible.
Being structured into a set of integrated but loosely coupled components your application only needs to depend onto the parts that are actually used.
All spray components are structured in a way that allows for easy and convenient testing
Bloomd is a high-performance C server which is used to expose bloom filters and operations over them to networked clients. It uses a simple ASCI protocol which is human readable, and similar to memcached.
Whats is a Bloom Fliter?
A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.
A Bloom filter consists of two components: a set of k hash functions and a bit vector of a given length. We choose the length of the bit vector and the number of hash functions depending on how many keys we want to add to the set and how high an error rate we are willing to put up with — more on that a little bit further on.
All of the hash functions in a Bloom filter are configured so that their range matches the length of the bit vector. For example, if a vector is 200 bits long, the hash functions return a value between 1 and 200. It’s important to use high-quality hash functions in the filter to guarantee that output is equally distributed over all possible values — “hot spots” in a hash function would increase our false-positive rate.
To enter a key into a Bloom filter, we run it through each one of the k hash functions and treat the result as an offset into the bit vector, turning on whatever bit we find at that position. If the bit is already set, we leave it on. There’s no mechanism for turning bits off in a Bloom filter.
Packer is an open source tool for creating identical machine images for multiple platforms from a single source configuration. Packer is lightweight, runs on every major operating system, and is highly performant, creating machine images for multiple platforms in parallel. Packer does not replace configuration management like Chef or Puppet. In fact, when building images, Packer is able to use tools like Chef or Puppet to install software onto the image.
A machine image is a single static unit that contains a pre-configured operating system and installed software which is used to quickly create new running machines. Machine image formats change for each platform. Some examples include AMIs for EC2, VMDK/VMX files for VMware, OVF exports for VirtualBox, etc.
Summingbird is a library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.
OOS is an object-relational mapping (ORM) framework written in C++. It aims to encapsulate all the database backend stuff. You don’t have to deal with database backends or sql statements neither with mapping of data types or serialization of objects.
It provides an easy to use api and as a unique feature it comes with one container for all objects - the object store. Given this container one has a centralized point of storage for all objects but with the ability to create views on concrete object types, link or filter them.