OpenLSD

OpenLSD : Document Archiving : Open Legacy Storage Document

(French Version is Here )

This project is framework for Enterprise Content Management (ECM - GEIDE in French) has for goal numerical archive and wants to be generic, that is to say it should not be dependant on the archived data, neither to their format or their index. This project is therefore not a complete solution by itself, but a complete framework where we can specify business by only specifying the type of index and create the associated retrieve application. It is framework for archiving and retrieving large volumes of static documents.

This project has the following properties:

100% full Java and note that it uses the MINA framework for the NIO support
Allows to handle in theory up to 2^192 documents split in 2^64 logical application spaces (Legacy), each of them split in 2^64 storage spaces (Storage) each of them containing up to 2^64 documents (Documents)
Each storage space (Storage) can reach up to 2^64 bytes (16 000 Peta bytes, or 16 millions de Tera bytes or 16 billions of GB)
In practice, a physical server should not have more than 256 storages (filesystems) for management reasons (talk to you SAN and System administrators to see what they think), so each physical server can handle up to 2^72 documents (4 722 billions of billions of documents). The number of documents in each storage depends on the filesystem limitation and the size of the documents. Generally, a filesystem is limited around 2^54 bytes and if you consider an average of 256 KBytes for each document, then you’ve got up to 2^36 in each storage/filesystem, so up to 2^44 documents in one OpenLSD physical server (17 592 billions of documents). Now the limit of the number physical servers is up to your money you can invest… Let say you can afford 32 servers like that, then you can have 2^49 documents (562 949 billions of documents).
Allows to create multiple copies (from 1 to n) on distinct localizations to ensure either the data security (replication) and either the speed of access to the documents in terms of network latency (closeness)
Allows the use of cache functions (both in read and write) to speed up the read of a document for a final user but also to speed up the import processes of documents inside the system in the case where the user (or the application) is not closed to one of the LSD localization.
Use a data base (JDBC : MySQL for small CRM up to some millions of entry, Oracle or PostGreSQL for bigger CRM)
Allows using a Java module in Tomcat (or other servlet software) to implement the reading interface for the documents ready in the CRM system.

OpenLSD is not self sufficient. Indeed, it does not take into account the business specificities (business data) which are used to index documents, but it does give the necessary functions to handle them with some few extra codings. OpenLSD is a framework which must be extended (an example is given in the source) to get the application wanted by the user. Mainly, one Java class is needed to be written. OpenLSD gives the following functions:

Index and data instantiations necessary to OpenLSD in a database in several tables (MySQL, PostGreSQL, Oracle : porting to another database software should not take more than 2 days)
Once the business logic implemented in the database and the external software adapted (should be done in 5 to 10 days), the system is ready
Import of documents: from the server that gives the physical storage (faster) or from the network (the speed will depend on the network); this import is secured (controlled, duplicated if in clone - multiple legacies - mode), validated (using a MD5 like algorithm), unique (the identification) and optimized (storage and network).
Extraction of documents: from the server that gives the physical storage (to realize for instance export for burning on external media), from the network (Java client or J2EE client from Tomcat or equivalent)
Integrity Control and Repair functions
Statistical Functions

More detailed information can be found here (Concept, Howto, API, download).

Benchmarks (more recent benchmarks can be found here)

In 2007, the following tests have been done :

Import Tests

Between 2 servers (Intel 2 processors with IBM JDK 1.5), we reach a bandwith of 80 Mb on a 100 Mb network link to import documents (100 KB by document so 100 documents by second or 360 000 documents by hour) but without database persistence.
On one AIX, 8 processors server with JDK IBM 1.5 64 bits allow to reach a bandwith of 1420 documents of 10 KB by second inserted in non crypted mode in the system with the database persistence (115 Mb). 1290 documents by seconds in crypted mode with 8 processors is reached.
On one AIX, 2 processors server with JDK IBM 1.5 64 bits allow to reach a bandwith of 170 documents of 100 KB by second inserted in non crypted mode in the system with the database persistence (136 Mb), and also in crypted mode but with 3 processors.

Web Retrieve Tests

From one Intel 2 processors server with JDK IBM 1.5 64 bits and Tomcat 5.5, we reach a bandwith of 120 documents (of 100 KB) by seconde in consultation (restitution from the web) with response time of 0,2 second by document, so a measured bandwith of 96 Mb on a 100 Mb network link. The Intel server was at 60% CPU, the AIX server was at 20% on 1 CPU.
With a second test identical to the previous one (consultation) but with documents of 10KB in crypted mode in OpenLSD, we reach 350 requests by second, 0,012 second by document, and a measured network bandwith of 30 Mb. The Intel server was ar 90% CPU, the AIX server was at 40% on 1 CPU.
The same test as before but in non crypted mode, we reach 350 requests by second, 0,012 second by documents and a measured network bandwith of 30 Mbits. The Intel server was at 90% CPU, the AIX server was at 20% on 1 CPU.
The same test as before in non crypted mode but with two identical Tomcat servers, we reach 700 requests by second, 0,012 second by documents and a measured network bandwith of 60 Mbits. The Intel servers were at 90% CPU, the AIX server was at 40% on 1 CPU.
The same test as before in non crypted mode but with three identical Tomcat servers, we reach 1000 requests by second, 0,012 second by documents and a measured network bandwith of 88 Mbits. The Intel servers were at 90% CPU, the AIX server was at 90% on 1 CPU.
The same test as before but in crypted mode with three identical Tomcat servers, we reach 1000 requests by second, 0,012 second by documents and a measured network bandwith of 88 Mbits. The Intel servers were at 90% CPU, the AIX server was at 75% on 2 CPU.

Import and Web Retrieve Simultaneously Tests

When simultaneously inserting documents and retrieving documents by the web (documents of 10 KB in both cases), the same performance is reached with 3 Tomcat, either in crypted or non crypted mode: 1000 requests by second, 0,012 second by documents and a measured network bandwith of 88 Mbits, up to 400 documents of 10 KB by second inserted in the system. The AIX server uses 5 CPU.

Consistency Check Tests

The validation of files inside the system (md5 consistency between files and DB) was done with 2400 files validated by second, so a bandwith of 470 Mb/s of tested bytes using 4 CPU.
The validation of files inside the system (existence consistency between files and DB without MD5 test) was done with 30 000 files validated by second using 8 CPU.

TEST	Servers	Results	Physical	Information
Check (Database / Files consistency) with MD5	AIX 5.3 4 CPU P5	2400 files / second or 50 MB/s	52 MB (400Mb) on a 2Gb SAN, 16 Mb on a Gb Network	200 Millions / day limited by SAN bandwith
Check (Database / Files consistency) without MD5	AIX 5.3 8 CPU P5	30 000 files / second	80 MB/s on SAN (640 Mbs)	2,6 Billions / day limited by SAN bandwith
File Import (10KB)	AIX 5.3 8 CPU P5	1420 files / second, 115 Mb/s, 0,7 ms/ file	28 MB (224 Mb) on a 2Gb SAN, 11,2 Mb on a Gb Network	On 40 000 files with 8 imports in parallel
File Import (10KB)	AIX 5.3 3 CPU P5	425 files / second, 18 Mb/s, 2,3 ms/ file	7 MB (56 Mb) on a 2Gb SAN, 6,4 Mb on a Gb Network	On 5 000 files with one import
File Import of 10 KB (crypted or not)	AIX 5.3 1 CPU P5	1 unique file in 2,8 second		JVM launch time and necessaries connections
File Import (10KB) in crypted mode	AIX 5.3 8 CPU P5	1290 files / second, 103 Mb/s, 0,8 ms/ file	25 MB (200 Mb) on a 2Gb SAN, 10 Mb on a Gb Network	10% more observed with crypto on most of benchmarks
Web Retrieve (10KB) (crypted or not)	AIX 5.3 2 CPU P5 + 1/2/3 Blade Center Bi Processor	350/700/1000 files / second, 88 Mb/s, 12 ms/file		CPU in LSD Server (AIX) is double compared to crypto mode