Big Data...turning a liability into an asset.

Taxonomy creation and Auto Classification

The organisational or categorical structure, often referred to as taxonomy, needs to be well-defined so that the user or automated technology knows the characteristics a document must have to be classified accordingly. This task, often seen as thankless is fast becoming an almost impossible one to achieve as the weight of incoming data overtakes the internal efforts to organises and classify.

The DataCube Taxonomy creation and Auto Classification Service is split into two sections, each reflecting the methodology used to achieve the schema creation. 

The Supervised Categorisation Service takes the organisation's existing taxonomy, builds a Schema Management system to identify exemplars (i.e. best examples of documents that would fit into that category) and then indexes and categorises the data in order to mark the data with the attributes for the document or record type for that category. The DataCube does have some existing schema models - for example if the client were a UK Local Authority, then the Service would use the a version of the Local Government Classification Scheme (LGCS) as the basis of the Schema Management model which is suitable for that organisation. Some bespoke categories may be added but most categories are already populated with exemplars, enabling the auto categorisation exercise to be undertaken within minimal build time.

DataCube provides a Schema Management module as the basis of the automated system. Each category is defined and the attributes and properties described, including default properties for the category e.g. Retention Period. Anonomised exemplars are then associated with this category, which provide best and worst examples of documents that fit into this category. Personalised Schemas can be created to best fit the organisation’s needs. 

Often, Industry standard schemas are used. For example, DataCube has used the Local Government Classification Schema for UK Local Authorities which defines 750+ categories which define all activities within their business. We have defined in detail and populated this taxonomy with over 15,000 anonomised exemplars.. An index and categorisation system is therefore ready to deploy into any UK Local Authority, although it does allow modification and enhancements to reflect differences for each authority. 

The DataCube scans and identifies all of the files in the network and creates a data inventory of every relevant data file, which contains details about the file including its file path. It then uses this information to collect text from every file and build an index. Finally, it compares the text from every document against all of the exemplars to find the best correlation, when it marks the document against the category of the exemplar that it has matched against. This process has been extensively tested on Local Authority data over millions of documents and found to be over 85% accurate and with further retraining of the system, it is expected to achieve over 90%. This compares with the ability for manual labelling which is believed to be less accurate (50%) and, given the volume of legacy data, impossible to do. 

The Unsupervised Categorisation Service is ideal for large datasets which are largely unknown in their constituent content and thus the appropriate categories for the dataset are not apparent. In this mode, the DataCube scans all of the organisation’s data and groups the document into a specified number of clusters of documents that are conceptually related to each other. This process (known as Dynamic Clustering) is a way to group documents based on their conceptual similarity to each other without using example documents to establish the conceptual basis for each cluster. This is very useful when dealing with an unknown collection of unstructured text. The DataCube provides the ability to group the documents into as many categories as the user specifies and even suggests a naming convention. In this way, the taxonomy will reflect the actual data rather than a proposed taxonomy which often is wayward in its even distribution of categories or topics.

As both methodologies are not reliant on existing databases or data models, but can access data wherever it is stored within the intranet, the results are powerful as they have access to and can process data which hitherto had not been readily or easily accessible.

The Taxonomy/Schema Creation Service and the subsequent Auto C enables many activities which hitherto had been impossible- an example of this is the access to the whole dataset not just particular ‘silos.’ 

The system does not change your original data or its locations (except by placing meta data tag on it- even the last accessed data stays the same) and is carried out on your premises.

© Copyright 2013 Apperception