To the extent that the trade press covers, meaningfully, the issues around digital information processing and technology, it tends to focus rather narrowly on the latter:  infrastructure.  The latest hardware innovation — the fastest processor, the slickest server, the most robust hyper-converged infrastructure appliance — tends to be the shiny new thing, hogging the coverage.

What is really missing, however, is virtually any discussion of data itself.  Data is supposed to be what drives all of the hardware and software development.  The data that an application produces, its useful life, its historical significance, its legal value are all supposed to factor into how we build processors, networks and storage to process, distribute and store data.

Yet, how data are managed and stored – the spatial challenge of writing bits to media, ensuring that they are free from corruption, properly protected and distributed for most efficient access – or how they are preserved until their usefulness wanes, are seldom discussed.  It’s a given:  data needs to be stored, accessed, then deleted once it has no value.  Coping with the challenges that arise from data management – whether they derive from the space or the temporal dimensions of digital information handling – tend not to get much attention unless you are a server, network or storage administrator who must deal with end user complaints about slow system response to file searches and retrievals, or with governance and compliance officers upset by a lack of proper stewardship of data assets, or with internal auditors who worry that critical data is at risk to loss from natural or man-made catastrophes.

Data management, while central to just about everything we do in information technology, hasn’t gotten much attention since we changed the name on the data center door from “data processing” to “information technology.”  Around that time, our attention changed from hosting, protecting, preserving and keeping private data assets to racking, connecting, and maintaining the computer, network, and storage hardware.

But recently, attention has been returning to data itself.  A big driver for the reassertion of data as a focus of IT planning is simple.  There is a lot of it.  Depending on the analyst one consults, we are looking at between 10 and 60 zettabytes of data presenting itself for processing, networking, and storage by 2020.  This number is mind boggling, and it is also potentially devastating – first to the industrial farmers of the cloud (Amazon, Google, Microsoft, IBM) who will be expected to store the lion’s share of the bits, but also to private organizations and public scientific research facilities with large data centers.  How do you store that much data (a zettabyte equals 1000 exabytes) when the combined annual manufacturing capacity of the disk and flash industries (the preferred storage media for most companies) works out to just under 1.8 zettabytes per year?

Another issue has to do with stewardship.  Regulatory and legal regimes have grown up in most industrialized countries requiring the protection, privacy, and preservation of certain types of data, sometimes specified with time frames spanning decades.  How do we provide the right services to the right data at the right time in their poorly defined useful lifespans?  It is already a challenge just to protect, preserve and keep private the data on your smart phone, tablet or laptop – and that is usually just a few terabytes of data.  Multiply the challenge and complexity by millions and you begin to grasp the zettabyte era.

So, practical issues abound in data handling and data management, a discipline for managing data assets in time and space, has been ignored for so long that we may lack the ability to tackle the issues successfully.  Googling the term data management will likely take you down a rabbit hole to the world of big data analytics and data science, which are not the same thing as data management in the broader sense.  Or you might encounter some links to articles on hierarchical storage management (HSM) or information lifecycle management (ILM), which were practices that didn’t really make the transition from the mainframe-centric computing world to the distributed computing world.

Needed is a fresh examination of data management and perhaps some guidance to the new generation of IT practitioners regarding best practices for wrangling all the bits.  The next four blogs in this series will set the stage for such an examination so that brighter minds than mine can come up with some solutions.

Part 1 will talk about the foundational concepts of data management, the drivers, and the expected benefits.  It will also introduce the ideas of capacity allocation efficiency and capacity utilization efficiency, what they mean and how they shape our goals and objectives in data management.

Part 2 will introduce the components of a cognitive data management capability, the umbrella term for the combination of processes that are needed to manage data effectively.  Components include a data management policy framework, storage resource management engine, a storage services management engine and a cognitive computing facility that maps policies to resources and services over time in order to manage data.

Part 3 will dig into the specifics of storage resource and services management and Part 4 will examine how far we have come with cognitive computing and how it might be applied to that “internet of things” that is our data and our storage infrastructure.

Hopefully, this will start the thought processes moving for a lot of readers and stimulate the creative solution-building for what is likely to become the most compelling technical challenge we have ever faced.  Kudos to Starwind Software for providing the space for this discussion.

Related materials: 

Back to blog