Know Your Data
Updated: May 30, 2019
Many organisations don’t realise it, but in our current environment Data has become the main differentiator in the market. Most companies are not prepared for this shift in the marketplace and risk not only being disrupted but being left behind by their competitors often simply because they don’t know their data or cannot make effective use of it.
Data is stored in many places in an organisation, often in business applications that are fragmented across various business units. This fragmentation together with the lack of interactivity between these systems, which require high availability and performance, means this data cannot be accessed for analytical purposes.
Making transactional data available for analytics required costly ETL (Extract Transform Load) processes that push the data into a centralised Datawarehouse or Data Mart. Many business users took matters into their own hands to avoid high cost and delays of formalised ETL by starting their own analysis of data using local tools such as spreadsheets that over time grew into small local databases.
With the advent of Data Lakes, the issue of data storage across the organisation is becoming even more complex as it becomes easier to push raw data into the lake without controlling source, usage, lineage and most importantly the business value of the data stored.
This landscape creates massive issues in Data Governance and Compliance with privacy laws. If the organisation does not have a clear picture on what data is stored it will be impossible to meet compliance requirements. The risk of heavy fines can be quite costly as governments globally take a hard line on privacy and reporting breaches.
The other issue organisations face today is Cyber Security. In our highly networked technology landscape hacking attempts are becoming common place. Protecting data that is stored in this fragmented fashion is extremely challenging, even more so when the organisation has no documentation of what is stored and where.
For many years organisations have made attempts to clean up and centralise data by consolidating small data stores into formalised and managed databases only to find that over time new pockets of data spring up across the different business units again.
Crowd Sourcing your Data
The reason for this ever-repeating cycle is that business units require a quick turnaround in order to answer relevant business questions often in real time. Often the business cases start off with use cases too small for a formalised funding request or IT is deemed to slow and inflexible to provide quick results.
The more IT and Data Governance teams embrace the natural way people use data in an organisation, the higher the value preposition for a formalised approach that meets governance and compliance requirements.
The changes required aren’t as radical or costly as many CFOs and CTOs would think. Firstly, the organisation needs to know what data it stores, where it is stored (including copies of the data) and where it originally came from.
Roadmap into the Data Democracy
The driving principle is that people can store and manage their own data, as long it is stored securely and accessible to centrally managed meta data stores. Secondly, a centralised Information Catalogue that can discover new data sources automatically, will provide a framework for governance across the organisation.
Democratising data in this way does not mean that we move away from formally managed and controlled data stores and centrally managed data warehouses and data lakes. These concepts have worked well and supplement the need for business and customers to access data quickly and efficiently. The main aim in using the catalogue is to get a view of the data landscape from the bottom up.
Central Data Governance – Decentralised Data
There are a number of different information catalogues available that manage meta data and provide a view of what information is available. Most catalogues were built as an extension to ETL tools or self-serve applications. Each of the tools can meet some of the requirements, but in our research the most comprehensive support is only provided by purpose-built catalogues such as Smart Catalogue from Waterline Data.
Waterline Data has very quickly identified the need for providing a clear view of data within Data Lakes that embraces a bottom up approach of organising meta data. Data in a large range of storage technologies across the organisation is discovered and profiled with the help of Artificial Intelligence. The automatically discovered data sources are then pre-tagged and presented to Data Stewards for review and acceptance.
This allows organisations to decentralise and democratise data classifications to the business units. Meta data is tagged with business terms understood in the business unit. For more formal semantic models a second tagging layer can be created that creates a bridge between the different business units.
The benefits of this approach are priceless as for the first time a large percentage of data can be catalogued and made accessible to Analysts across the organisation in rapid time.. Not only can the organisation meet compliance requirements, but it becomes possible to identify opportunities that will transform the business.
Fusion Professional’s Big Data and Analytics team is specialising in modern analytics concepts and technologies. Our team can help your organisation to identify your requirements and provide advice on the best tools and processes. Please feel free to contact our office for more information.
Achim Drescher is the Managing Consultant of the Big Data and Analytics Practice at Fusion Professionals.
With 30 years in the IT industry, he is an Expert in Enterprise Software and Data Architecture, Data Governance frameworks and modern analytics platforms for Big Data and Data Lakes.