Network Attached Storage :: To provide a NAS device which caters for the most common data storage needs (distributed/scalable, replicated/available and striped/performant storage models), within a native storage architecture that will enable multi-box, multi-rack, multi-site configurations.
Project output includes example hardware component information with assembly instructional documentation, Linux firmware and operating documentation.
Project Saturn was started in 2006 to construct a robust Network Attached Storage capability: to provide a NAS device which caters for the most common data storage needs (distributed/scalable, replicated/available and striped/performant storage models), within a native storage architecture that will enable multi-box, multi-rack, multi-site configurations. Generation one ("Pan") was built from GlusterFS on x86/x64, however the most recent incarnation ("Daphnis") has been built from MooseFS on Odroid HC2.
Specifically, Saturn aims to be/have;
Must be capable of concatenating multiple drives on either a single node or multiple nodes into a single contiguous storage capacity
Must be expansible such that adding another drive to an existing storage capacity must not require recreation of that storage capacity (destroying, creating, formatting)
Must be capable of replicating data between drives on either a single node or multiple nodes into a single highly available storage capacity
Must be capable of multiple replications to support the "two copies in the same rack, one copy in another" principle, and to address the Bathtub curve penalty seen with bulk drive purchases i.e. multi-drive/node failures
Should support high-latency asynchronous replication to enable multi-site deployments
No dependence on strict RAID implementations
Must not employ proprietary on-disk storage - Hardware RAID tends to employ proprietary on-disk storage and thus the volume or, indeed, any individual disk becomes unrecoverable when the RAID set is broken
Must enable good throughput on commodity equipment - Software RAID tends to bring about poor performance without accelerated hardware features
Recovery of replicated data should not be location specific
Should be able to recover a failed node at one site either at its home site or at any other site with that data (i.e. being able to transport an empty or out-of-sync node to another location for high speed data recovery is highly desirable)
No need for manual maintenance
All maintenance tasks should be exceptional and ideally limited to Moves, Adds and Changes
All sub-components that require routine maintenance should be automated where-ever practicable
No lost data
Data degradation (aka Bit Rot) will be automatically detected and corrected
When configuring the NAS it should not be possible to lose or destroy data without asserting a conscious administrative decision to do so
Locally (LAN) presented storage capacity should be accessible via native client interfaces (like SMB or NFS), though client-side agents are acceptable
Remotely (WAN) presented storage capacity should be accessible via API-friendly interfaces (like REST based object storage)
All stored data (including metadata) will be encrypted at rest, to mitigate physical compromise (theft)
All data in transit should be encrypted, to mitigate local network compromise (MitM)
Private mount points from the common storage pool must be authenticated
The build guide (Saturn Installation and Operations Manual, included below), describes a highly-available scale-out storage fabric built on the Odroid HC2, 16TB drives (Seagate ST16000VE000), whole / full-disk encryption (covering both the MooseFS chunks and the MooseFS metadata), hardware / physical security token authentication via Yubikey, and is entirely passively cooled. In addition to a step-by-step build process for both hardware and software stages, the document includes some guidance for operations (including details on security, performance and thermal management). Further, for those interested, it also includes a complete Bill of Materials and Cost Benchmarking against the market for solution context as at the time of the original release (June/July 2020).
To achieve the project's availability goal, the build guide takes the MooseFS Storage Classes Manual example of "Scenario 1: Two server rooms (A and B)" and adapts it to a deployment across two shelves to assure predictable availability within a site. By applying labels to the MooseFS chunk servers, and applying a new storage class, the logical availability grouping within MooseFS matches the physical availability grouping within the Data Center. As data is replicated to at least an "A" node on one shelf and a "B" node on another, and each shelf with its own power and network infrastructure, an entire shelf can be lost or disabled without losing any content. This architecture can be scaled to the additional rack with the creation of "C" nodes and a slight variation of the storage class.
Also included in the documentation (below) is the new Digital Asset Management Guide - A Taxonomy For Digital Repositories. Given that this is a storage project the DAM Guide has been authored to provide open guidance regarding the management of unstructured data. Once a business is on to its third terabyte of unstructured data (or an individual starts working with more than a couple of USB drives worth of files), the question inevitably arises: "how do I manage all of these files?" The most intuitive computer interface for unstructured data is the file-system, and the ability to file content away and retrieve it quickly does not require a search engine or even database, so long as a logical taxonomy is articulately defined and easily understood, and naming conventions are clear and consistently followed.
The Digital Asset Management Guide starts from first principles;
Create useful containers -
It is common to direct a certain class of infrastructure and/or applications to a certain type of file; for example, digital signage will want a directory full of picture, flash, or video files. The taxonomy and naming conventions should support this type of grouping, such that applications can be associated with one or two directories and don’t need to be given the entire file system to find their content of interest.
Store handles for, and not classification of, content -
The file-system (directory and file names) should only contain sufficient information to successfully file and retrieve content in a useful way. Applications that consume certain file types (such as iTunes, for audio files) will acquire and store their own meta data, providing additional views of classification. Those views (such as Author, Genre, etc) should not be embedded into the on-disk structure.
The absolute minimum data required at the file-system layer would therefore be the data required to "look-up" the metadata (aka, the handle). This could mean storing documents named as their invoice number for example. However, the practical minimum data required at the file-system layer needs to include consideration of how users are to interact with the files – i.e. a single list of invoice numbered files may not be intuitive, but an issue date and invoice numbered file inside an organisational directory might be.
Remove ambiguity -
The resultant taxonomy should not create opportunities for confusion (i.e. should I file this content here or here).
Minimising/removing metadata from the taxonomy and naming conventions will aid usability. For example, filing a Lecture in a directory of Lectures is straight-forward. Filing a Lecture in a structure of institutions that is parallel to a structure of lecture topics increases filing complexity and makes access less reliable.
Consider performance -
Although the performance of the storage sub-system is not in the scope of the content management guide, poor taxonomic structure can impact storage performance.
Broadly speaking file-systems are designed as hash tables. If the application only ever adds/moves/deletes specific files then the directory structure has very little impact on the performance of the application. If the application provides some sort of browsing or scanning functionality then having millions of files in a single directory will increase load on the storage sub- system and decrease application performance, negatively impacting user experience.
Leverage existing standards -
Where common structures and/or naming conventions exist, these will be leveraged.
Domains covered in the taxonomy so far include Apps, Human Resources (or Family), Devices / Platforms, Finance, Legal, Multimedia, Personal / Private and Projects. So, whether you’re using Box / Dropbox, Google Drive, iCloud, One Drive, S3, a NAS from your Data Centre or a local file server, this initial taxonomy has been developed to provide you with guidance to intuitively manage your digital assets.
The following screen shots show the software or hardware developed for this project, in action;
The following documents (papers, guides, manuals, etc) have been developed for this project;