One of the biggest projects I have is about the Management of Document in Electronic format, to be more specific it is about Documents Archiving and not about content management. In fact, my employer has a goal to store about 2 PB of documents (2 000 TB or 2 Millions of GB), that is to say more than 200 billions of documents with an average size of 10 KB by document.
This project was given to my team in 2004 with the help of 2 people, and specifically M Vincent Castella as the Project Chief in my team. IBM has also greatly participated in our studies.
My employer has made a choice of software (proprietary software) in the beginning of 2002.
During our studies and development, it appears that this solution has some limitations, like its competitor softwares. Among them, I can quote those:
There is a limitation concerning the storage space saying how many TB can be managed: most of those softwares were created during the 90's where the storage would not go further than 1 Tera Byte. We have a factor at least of thousand to consider for our project.
There is a limitation concerning the storage usage saying how many files can be stored in one storage space : again, due to the limitation to 1 TB from one side, and to the usage in the 90's of 32 bytes programming model on the other side, the number of document are therefore limited to some billions. We again have a factor at least of hundred but that could easily tend to thousand to consider for our project.
- There is a real weak point of the security of the document based on the replication on two sites: in the 90's, the data security could be established using tapes or using slow on-line replication. But today, according to the number of documents to save (our estimation is about 1 million of document in one hour in heavy load), the methods of the 90's are no more compatible with those facts. The replication speed we measured is about 4000 documents by hour so we have again a factor of a thousand to consider for our project and the tape capacity are not compatible with the speed of saving and reading (up today).
Moreover we have to face another problem which concerns the file transfer. Indeed, to get this million of document by hour, this implies that over 100 000 files must get in buy file transfer protocol every hour (each file containing about 10 documents). However the secure file transfer protocol software cannot guarantee such volume, again for the same reason they were developed in the 90's.
We cannot use FTP since it is not secured neither in crypto neither in quality of transfer.
SFTP, the FTP protocol over SSH, answers yes to the first criteria (crypto) but not to the quality of transfer such as the guarantee the transfer was ok, should it be redo from the beginning or from another check point, what kind of transfer we've got, from who and when, can we set some automatic actions in pre or post transfer operation...
-
This is the kind of functions we need, functions that runs in file transfer monitors like CFT software. But those softwares suffer when too many transfers get into the system.
When I focused on those two problems, I decided to develop two systems, quite close in conception but really different regarding their usage, completely in JAVA (1.5) et and I have the project to develop a third opus, deduced from the first one specifically to handle the legal email archiving :
OpenLSD : Software from Document Archiving : Open Legacy Storage Document
OpenLSM : EMail Archiving : Open Storage Mail
-
OpenR66 : Monitoring File Transfer Software : Open Route 66