ConceptLogic

A PDF version is available here!

OpenLSD stands for Open Legacy Storage Document, i.e. a framework that enables document archiving in a secure and powerful way and able to handle very huge amount of documents. Document can be of any types.

The main ideas are from me but, as anyone, I could not achieve so much work without the help and great discussion of several guys, and specifically M Vincent Castella, the chief of « generics » (private joke).

OpenLSD, a framework from which you build your specific archive software for your needs

OpenLSD is a framework and is not a software as a final product can be, even if some examples are included. In this logic, the type of document can be of any type since it is the business logic to handle any specific format and to know what to do with it. For instance, OpenLSM (Legacy Storage Mail) is based on OpenLSD and allows the archiving of email from email client software (thunderbird or any others if we can implement something as an interface). OpenLSM has the responsibility to handle the email format. The example within OpenLSD is a simple archive application using one string as document identifier or index and using a web interface (JSP using Tomcat) to get or put documents.

Starting from examples, one can implements his specific archive software, spending times on business approach and not on implementing what OpenLSD stands for.

The main idea is that they are some “Impl” packages that must be adapted to fit the business needs. This construction enables fast implementation from current example. One can of course modify, extend or create a new implementation from the example.

Multiple Clients-Server approach

OpenLSD has two parts, server and client, but cannot be reduced only to a client-server approach.

OpenLSD can be used in a strict server-client protocol, but also in a web services approach (such as using the JSP examples) or in a distributed approach, with or without centralization of the documents. It can be used for small business as for big entities as it was intend to be able to handle huge amount of documents and connections but also it was done as simple as it can be. It uses NIO support from MINA framework.

Server on Archival process

The server part does nothing with the business properties and is only responsible to store document and to retrieve them when asked. The server only knows the three indexes: L, S and D (Legacy, Storage and Document), each of them in 64 bits so 2^(64*3) as a maximum number of documents in one OpenLSD Service. The size of each Storage can be up to 2^64 bytes, so each Legacy can be of size up to 2^128 bytes. One OpenLSD service can handle in theory up to 2^192 bytes and up to 2^192 documents. Every action is performed after receiving a communication using the NIO socket MINA framework.

The server part has no direct relation with the business and so it does not have any relation with the database. This is intend to enable an efficient implementation, stable and is not supposed to be modified whatever the business logic that the clients will implement. All indexes are handled by the clients, using a database to enable unique index, to handle the storage capacity and to allow business access and logic.

Three levels: Legacy, Storage, Document

The logic of archiving in OpenLSD server is using three levels:

Legacy – This is the highest level, each Legacy can be considered as one Business archive application. Each Legacy is composed by many Storages (up to 2^64 storages by Legacy). For instance, archiving emails and pictures can be stored in two different Legacies. Another example, if two groups of people want to use different storage and business logic, they can use two different Legacies.

Each Legacy has properties such as a specific crypto key (if any) to encrypt documents on storage, or the size of each storage, or the status as open or closed. One OpenLSD Server can handle up to 2^64 Legacies.

Storage – This is the medium level, each Storage should be considered as one storage space. Each Storage includes many Documents (up to 2^64 documents by Storage). In general, a Storage is one filesystem with the highest size that a filesystem can be. Size of filesystems are allowed up to 2^64 Bytes.

All storages from one Legacy will have the same limit of size for the underlying filesystem. The size is accurate from the document sizes modulo the size of one block in the underlying filesystem implementation (4K by default but can be changed) but it does not take care about the extra bytes when the file is stored using a crypto function.

So the size defined inside the Legacy must take care of the average file size that are intended to enter the system, usually defining a size 5% less than the real size of the filesystem should be sufficient.

For instance, let say one filesystem is 16 Tera bytes (2^44 bytes), then in the Legacy, the size should be around 16 Tera bytes in decimal order (so 16*10^12).

Document – This is the lowest level, each Document can be of any type and any size (limited to 2^64 Bytes, which can not really be considered for now as a limit ;-)).

Client on Business

The client part is the only one to know the business properties (for instance, what to do with documents, who is able to see one specific document, how can we find or import one specific document using which index). The proposed framework client is using a database as permanent data storage for index and other properties.

OpenLSD is written using JDBC and was tested with PostGreSQL, MySQL and Oracle. Of course, the biggest the number of documents to be in the system, the more robust the database must be. As for our own tests, Oracle was the more stable and efficient. PostGreSQL should be close (at least for the less than 1 TB database). Our tests with MySQL show that MySQL is OK with small and medium office size. But it was not with biggest office (more than millions of documents and thousands of final users). Of course, any specific business can have its own conclusion!

In fact, there are several types of clients:

Importer or auto-importer: those clients know about the business, the database and how to communicate with the OpenLSD server (or even several servers, see later). Importer can work both in local way (as usual archival software) and through the network.
Get (using Tomcat or simple java batch): those clients know about the business, the database (read only generally) and how to communicate with the OpenLSD server (or several servers) in read only.
Admin: those clients handle some specific functions as starting or stopping some or all OpenLSD services.
Check: those clients handle some check as consistency check (database and OpenLSD consistency) or some connections checking functions.
Check Similar: this client handles a check of similarity of a new document not already imported to see if one document in one specific Legacy is already imported elsewhere based on a binary comparison.
Advanced services: here we can found some new services, as the one we plan as the cache function. The cache function would have effect both from the read point of view and from the write point of view, enabling low bandwidth network to work correctly with the system. It will allow also restricting the database access to a secure point and not in a distributed way.

Security aspects

The security of the system is partially taking into account by OpenLSD by using different protocol and technical aspects. For example, the user authentication is not on the OpenLSD area because it is on the final software responsibility to implement this kind of security. Here are the security points that OpenLSD take into account:

The message passing is using a proprietary protocol in order to not open directly the access (read or write) to the documents for the final users. The physical storage is completely handled by a single (multithreaded) java process, ensuring security of access (read or write) through the OpenLSD protocol. Therefore, each access to the documents is controlled by the OpenLSD system.
Documents can be stored using a crypto logic. The key is relative to the Legacy and can be different for each Legacy. This encryption ensures that if someone access to the physical files, it will not be able to read the document natively.
A MD5 key is computed for each document and is stored in the database. This ensures that the document inside the system is still the same than the one that was inserting in the beginning. Of course, this security depends on the database security also.
Security of storage (replication to protect on physical corruption) can be done using several ways:
- A physical mirror between storage unit (example using asynchronous mirror between disk storage): the main advantage of this solution is the easiness of the implementation, the main disadvantage is the price (should be the network link or the software part of the disk storage). The propagation of user's mistakes (such as one deletes a directory, then implying the same delete of the same directory on the mirror part) should be minor as it is unlikely that one user can access to the disk storage and moreover with the ability to delete anything. Generally, this kind of mirror is limited to one copy only.
- An application mirror between two instances of the same application: the main advantage is that the application is fully responsible of the replication so of the consistency, the main disadvantage is the potentially high latency to be introduced inside the application in order to validate the insertion of the document in both applications. If the main application wants to trace everything, this should be the best choice since the application is responsible of everything so it knows what it does.
- An OpenLSD mirror between two instances of the same Legacy: the main advantage is that the application doesn't have to take care about the replication since it is taken by OpenLSD, the main disadvantage is the potentially hole of replication since OpenLSD will give its acknowledgement of insertion after the document is inserted inside one Legacy among those existing, thus implying that the replication is not finished yet but in progress. If the replication process is not mandatory in the trace, then it should be the best choice.

One of the advantages of the mirror is that it enables main applications to access documents through the closest and ready repository. OpenLSD is able to handle up to 2^64 replication hosts.

Some functions have specific security techniques such as key or password checking.
Even if the legal people tends to say to keep everything in archive, such that one will never delete a document theoretically from the system, considering very large archive database, we cannot assume to store everything for eternity (at least with the current technology). So we implement a solution to delete a document when its time living is over but in secured way (using key or password checking, TCP/IP security such as filtering and MD5 double checking).
The database is also to be taken into account in the security aspects. Your Database administrator should help you to improve the security and the reliability such as using a database replication, whatever the way you use. For OpenLSD, the necessary right are very simple (select, insert, delete, update rows, truncate table - on one temporary table only - and execute procedures on the OpenLSD schema).

Secure and efficient deleting document support

Considering the last point (deleting documents), in the early 90's, deleting a document from an archive system was considering as unacceptable. But now considering huge systems, most of the people cannot afford to store everything forever. It does not mean that keeping forever a document in OpenLSD is not possible. In fact, by default, this is the normal way. One document will never be deleting from the system if a specific action is not taken. This specific action is based on a triple check: first we can set a TCP/IP filtering, then a key or password protection, and finally it checks the given MD5 according to the real MD5 of the file. Once everything is ok, the file is deleted.

One can say, why all of this about simply deleting a file?

Well, if you recall, we take as example huge amount of documents. What if the system brings in huge amount of documents every day and takes out (deletes) a relatively close number of documents, for instance archive with 6 months of living status? All archival systems from the 90's will have one big problem: the internal index will not be able to handle a long term of the system living since their index are always increasing. To be more explicit, take the following numbers as an example:

2^24 documents enter the system each day (16 millions a day, 700 000 by hour)
Every 6 months, the documents are deleted, so there are at most 2^24 * 6 * 30 equals around 3 billions of documents in the system.
The internal index is based on 32 bytes, so able to store more than 4 billions of documents, which is superior to the number of documents that should be stored.
But after 2^8 days (256 days so less than one year), the index reach its maximum value (2^24*2^8 = 2^32), so the system is KO.

So what OpenLSD proposes is to take care about this deleted index and to re-use them when one wants to insert a new document.

Keep storage low in usage

A side effect, but as important as the first goal to handle correctly the index when deleting a document, deleting a document will free some storage space. OpenLSD will try to fill as much as possible the free spaces and not asking for a new storage so as to keep your investments in disk storages as low as possible.