Taxonomy Creation and Auto Classification
The rise in content and document management systems has brought both order and chaos to some organisations. The realisation of multiple systems with different taxonomies have precluded the time and effort needed to be put in to achieve a single system (or at least a common taxonomy!)
The organisational or categorical structure, often referred to as taxonomy, needs to be well-defined so that the user or automated technology knows the characteristics a document must have to be classified accordingly.
The DataCube provides two methods of helping create schemas or taxonomies The First methodology is the Supervised Categorisation method where DataCube takes an existing schema and amends it to suit the client organisation- such as a personalised LGCS- and then populates it with exemplars from each category and subcategory, enabling a categorisation exercise to be undertaken within a personalised schema.
The DataCube provides a Schema Management module as the basis of the automated system. Each category is defined and the attributes and properties described, including default properties for the category e.g. Retention Period. Exemplars are then associated with this category, which provide best and worst examples of documents that fit into this category. Schemas can be created to best fit the organisation’s needs and can be bespoke.
Often, Industry standard schemas are used. For example, DataCube has used the Local Government Classification Schema for UK Local Authorities which defines 750+ categories which define all activities within their business. We have defined in detail and populated this taxonomy with over 15,000 anonomised exemplars. An index and categorisation system is therefore ready to deploy into any UK Local Authority, although it does allow modification and enhancements to reflect differences for each authority.
The DataCube scans and identifies all of the files in the network and creates a data inventory of every relevant data file, which contains details about the file including its file path. It then uses this information to collect text from every file and build an index. Finally, it compares the text from every document against all of the exemplars to find the best correlation, when it marks the document against the category of the exemplar that it has matched against. This process has been extensively tested on Local Authority data over millions of documents and found to be over 85% accurate and with further retraining of the system, it is expected to achieve over 90%. This compares with the ability for manual labelling which is believed to be less accurate (50%) and, given the volume of legacy data, impossible to do.
26% of companies have four or more EDRMS systems- AIIM source
61% of companies say they have ‘sub-optimal‘ SharePoint deployments
The second methodology which can be used is the ‘Unsupervised Mode.’ In this mode, the DataCube scans all of the organisation’s data and groups the document into a specified number of clusters of documents that are conceptually related to each other. This process (known as Dynamic Clustering) is a way to group documents based on their conceptual similarity to each other without using example documents to establish the conceptual basis for each cluster. This is very useful when dealing with an unknown collection of unstructured text. The DataCube provides the ability to group the documents into as many categories as the user specifies and even suggests a naming convention. In this way, the taxonomy will reflect the actual data rather than a proposed taxonomy which often is wayward in its even distribution of categories or topics.
As both methodologies are not reliant on existing databases or data models, but can access data wherever it is stored within the intranet, the results are powerful as they have access to and can process data which hitherto had not been readily or easily accessible.
- Accesses whole dataset not just particular ‘silos.’
- Automatically achieves what is impossible in a manual exercise extract the results into other applications such as SQL Server, SharePoint or Dynamics, or a Document handling system
- Create communities of subject matter or by automatically clustering content and resources into related groups.
- Access whole datasets not just departmental silos
Every system has documents which it doesn’t need and in some cases should not have. The advent of emails with attachments, the lack of many early documents having any security or protective marking and the time passing daily creates orphan data as the creators or owners leave, move, or the departments merge. This means that huge numbers of documents are in effect abandoned. Straw polls taken estimate that most documentation doesn’t get touched ever again once two and a half years from its creation date has passed.
Add to the list, the personal data which is stored at work, the work data which is stored outside of formal systems, retention policies only being executed on current data, and it all adds up to a potential time bomb- with up until now very little that could be done.
DataCube has the ability to identify by its content, documents which are then placed in the appropriate categories according to the organisations schema. This is supplemented by DataCube’s Data Inventory module which identifies where each document is stored and its details. So if an early confidential document is discovered which relates to a topic which certainly shouldn’t be in the general company storage, DataCube can identify it by its content, and should it have been amended but still contain the same conceptual meaning, DataCube can locate it and notify of its presence.
Near Duplicates, Exact duplicates, retention expired documents, by age, topic, trigger points and expired projects, personal files, illegal files (music) are some of the examples which DataCube can identify and help clean the record management as well as release storage data.
Document Management Systems Tools- EDRMS
Creating the Taxonomies which reflect your organisation:
Migrating the unstructured data into the EDRMS and auditing the existing systems to ensure accurate document in category placement.
The use of Document Management systems are becoming commonplace within the IT world, however very few people (6% according to AIIM statistics- SharePoint Survey) actually are completely happy with the results of their implementation. Everybody can see the huge benefits which an EDRMS system can give an organisation, but not everybody quite sees the huge amount of work and discipline needed to be brought to bear to ensure that the project succeeds.
The implementation of document handling and document security is fraught with issues such as the creation of a Taxonomy and the correct placement of each document into the appropriate category. Unless successfully managed, these issues often cause the full potential of any system to be unrealised and the system to become a brake, rather than an aid, to the organisation
DataCube can ease those worries and helps overcome the two major issues which are encountered during the introduction of an EDRMS.
1) How to define a workable taxonomy for the Document Management system
2) How to migrate the data into the new Document Management system when the taxonomy has been defined
We have discussed how DataCube can take unstructured data and by using its conceptual search techniques, sort all the documents into a schema, either by using the Supervised or Unsupervised (Dynamic Clustering) techniques. Once having achieved this, one then migrates the results of the Auto Categorisation (classification) into the particular category within the EDRMS. This technique, is both reliable and most importantly, repeatable to give consistent results and an accurately position data set within the system.
Whilst DataCube works to provide a categorisation system for unstructured data, it can also check the contents of document management systems such as SharePoint and confirm that the data has been correctly placed into the EDRMS. DataCube can analyse the metadata of the document as well as the content and can provide reports and checks that the two are accurate within the defined taxonomy.
Where data continues to be saved in an unstructured manner, DataCube can continue to be used to migrate the documents into the EDRMS but also to name and shame users who continue to behave in this way, leading to appropriate enforcement of the policy.
There is a growing view that with the volume of unstructured data now being generated, existing Document Management Systems and more widely EDRMS systems will not cope or will be swamped with data that will make them unusable. In that scenario, a tool like DataCube will be required to help create order out of potential chaos and can help the usefulness and the future proofing of the EDRMS.