Big Data...turning a liability into an asset.

Legacy Data Management

Data Cleansing

Companies and organisations are proficient at securing their data perimeters, but often are not so good at sorting out their existing data within their systems. Multiple departmental changes, takeovers, staff moving or retiring, old projects finishing and discredited data storage and document management systems have all led to many datasets being virtually uncontrollable. DataCube has addressed these issues and has created the solution set to resolve these issues.  

The DataCube needs to collate the information regarding the dataset it is working on. To achieve this it builds a Data Inventory where it stores all the pertinent details regarding the data, its locations, file paths, metadata, protection levels, retention details and any other required fields. This inventory then allows the analysis of the data enabling the recognition of many characteristics. It also recognizes duplicates, Email threads which are multiple copies of original emails, dates of documents, and can identify authors who have no longer any connection with the organisation. It can also recognize subject matters and their dates thus enabling different department ‘cut off’ points to be recognized and deleted or archived. The most immediate gain here would be the identification of the ‘date expired documents’ which could legally be removed from the storage facility to free up disc space but also reduce ‘discoverability’ regarding FoI requests and other enquiries.

  • The DataCube system will analyse all data and produce an array of reports detailing existing data including:
  • File numbers by types and locations, Files by sizes, Mimes groups, Duplicates- 
  • MD#5 analysis of each documents unique content to establish exact duplicates –irrespective of file label or metadata
  • Analysis of its metadata and its activity history; created, modified author originating source
  • data cleansing-outdated and orphaned data


Most companies have legacy or historic data which comprises of many millions of documents, most of which will be stored in ‘local’ storage places – i.e. not a document handling system. Often this data is subject to ‘personalised filing methods’ and in the case of staff leaving the organisation, most likely orphaned with no one understanding or knowing what is there or how confidential it may be.

Over 20% of all data held is deemed to be out of data, redundant, orphaned, not owned by anyone and much of it is held against company policies.

Assessing the metadata or the titles is insufficient to capture the necessary protective Marking levels which ought to be assigned to each individual document. The content of each document ought to be analysed to establish its place within the council’s taxonomy, and its protective marking- (today- Confidential, Restricted Protect, and Unclassified).  It is well established that the scale of this task makes it impossible to undertake this manually.

DataCube achieves retro labelling running the personalised schema (complete with document protective levels denoted within the Inventory), against the entire legacy data set to automatically sort, using conceptual searching methods-(Latent Semantic Indexing), the data into categories which reflect the organisation’s personalised schema and then adding the appropriate protective labelling both visually and in the metadata on the document; 

  • Tighter control over Data- less likelihood of breaches and Data Commission Fines
  • Speedy Results compared to alternative methods- Highly efficient and economic method of updating security status of legacy data
  • Freeing up of duplicated disc space- identification of ‘dead data’ and time expired data and confidently removing it.
  • Enables organisation to implement protective labeling systems knowing that the hitherto unreachable problem of retro-labelling of the legacy data is now done
  • Avoid costly mistakes Reputational damage- Careers, Fines, and indications of overall efficiencies of the organisation- Personal responsibility of SIRO or CEO
  • Automated Support for People and Processes enhances staff involvement and ownership of data security
  • Compliant with Governmental guidelines and policies

Data Retention Policy Management

Many organisations have millions of documents which they have stored but have no method of analysing the documents to establish either what the subject matter is or when it ought to apply its removal or retention policy. To add to the difficulty- retention dates are multi-facetted with different triggers informing the retention date. An example of this may be ‘a maintenance contract which the retention policy states should be kept for 5 years after the contract in completed and finished.’ Some documents can be readily removed-e.g. ‘12 months after the financial year end for 2009,’ but the officer responsible has to be sure that there is no other information within the documents which would trigger another retention condition. Without knowledge of the content decisions may be flawed in their accuracy.

DataCube uses its categorisation capability to sort all documents into the appropriate subject based category. Each category has recorded within the Data Inventory the document retention details.  Once the documents have been categorised, the DataCube provides detailed reports on the documents which are due for deletion or review. Those which are due for deletion may be automatically deleted from the interface. Those which are due for review are presented and the operator is given the choice whether to ignore or delete the document and the opportunity to reset the earliest review date.

Custom plug-ins can be defined to further scrutinise the documents which fall into classes which have a policy dependency external to the document itself. These are used to analyse pre-defined document types (e.g. employment contracts, case reviews) which follow a format (or collection of formats); the custom plug-in serving to extract and normalise essential trigger information such as contract termination date, date of birth.

  • Aged file analysis by its content subject matter by category, enabling real understanding of a document's providence as well as the various restrictions and rules which should apply to it. (The application of the schema will enable the application of rules to documents based on their content rather than their title or some keywords)
  • Companies need to be able to access its legacy data and reassure itself that it has taken the correct steps to ensure that any retention policy and detailed instructions on the handling of the Organisation’s data has been adhered to. 
  • Triggers held in the Inventory enable complex ‘out of document’ retention information to be utilised.
  • Legal position is protected by ensuring that both deletions and document keeping is in line with legal and policy requirements.
  • Can address millions of unstructured documents.

Data Protection Assurance

Examples of data that needs to be protected are not difficult to identify. One of the Local Authorities that we dealt with was motivated to implement a DLP solution when case notes about a child abuse case was leaked to the press by one of its own staff. Such sensitive data can come in many forms, such as financial information, intellectual property (IP), customer or personal data, credit-card data, and other sensitive information depending on the nature of the organisation.

“ Big Data theory is moving faster than the reality of what an enterprise is capable of from both a technology and manpower standpoint.
–IT Business Edge

DataCube provides a comprehensive data security solution that discovers, monitors, protects and manages information wherever it is stored or used. Once the DataCube solution has been implemented and populated, the system is able to provide a management platform to enforce data loss policies, incident remediation and risk reporting from a single management console. DataCube is able to protect confidential data by automatically enforcing data loss policies. The system interrogates the content of the data being sent – either as email, printed copy or saved files – and assesses whether the actions conform to the security policy. Rather than simply check on a text string e.g. a credit card or social security number, the DataCube analyses the content itself and compares it to the categorisation schema and attributes for the category that the data belongs to. Users can be prompted or questioned to ensure that they are taking the correct action or the action can be referred to a security gatekeeper.

The DLP solution is complimented by an online user education programme about data security, including how to secure exposed data and stop data leaks. 

  • In-use or endpoint actions i.e. prevent breaches by monitoring, detecting & blocking sensitive data by the end user when receiving or sending data
© Copyright 2013 Apperception