People of a certain age or with an interest in science and science fiction will remember the cuteness and the horror of the machines when it starts to look and think like a human. K-9, Dr Who’s faithful dog/machine who sacrificed himself for the Doctor. R2D2 and his erstwhile robot mate C3PO argued and bickered all the way through Star Wars, cementing likeable personalities into machines.
The Matrix with its dominant machines using humans to feed its ‘civilisation’ and the father of them all, the chilling HAL in Kubrick’s 2001- A Space Odyssey, frightened us in the battle for supremacy where we want menial or tedious or huge but repetitive tasks done by machines- but under our control.
Talking of IBM, its Big Blue computer finally beat Gary Kasparov at chess in 1997, having narrowly lost in 1996. Its ability to analyse 200 million potential positions a second speaks well of both Kasparov and of course the IBM Programmers.
All the above cases are either fiction- although with some chilling insights into future technology- remember the Minority Report’s shopping mall?- or huge investment into specific areas. In everyday data we have a growing issue- the size and complexity of the volume of today’s data, added to the problem of all those years of legacy data- almost an insurmountable problem, if you are looking to implement a Document Handling System (EDRMS)
Take any company – assume it has 10GB of storage- that probably means 10s of millions of documents. Document handling systems are only as good as the taxonomy they are provided with, and the skill which the owners have used to place the existing and current documents into the correct categories.
The company wants to sort the historical documents into the categories- What does it do?? Hire temps, interns or even get the departments to undertake the work themselves?- or use a system like Apperception’s DataCube which automatically categorises each document into its correct category or folder within its taxonomy - using DataCube’s LSI based Conceptual Searching to cluster documents with similar content concepts- not just by words or strings but concepts- into the appropriate folders.
Well let’s look at the alternatives…
The manual system is actually unrealistic, but for comparison purposes a guide to the costs can be obtained from a research article (Oard et al. 2008) who stated that in a scientifically supervised trial, a team of undergraduate and graduate legal students achieved an average of sorting 21.5 documents per hour, but at a categorisation accuracy of only 55.5%- so to manually sort 50,000 documents would take 50,000/21.5 = 2325 hours (at 6 hours per day) =387 days or nearly 2 man years. (With an almost 1 in 2 error rate.)
Multiply that by 10 for 500,000 documents and you have 20 man years- Hmmm. Not going to get much career lift out of that!!
So before you get into choosing your EDRMS think about two things-
Does your taxonomy fit your business??
How are you going to populate it accurately without using the Apperception DataCube?