We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!
Most companies today have invested in data science to some degree. In the majority of cases, data science projects have tended to spring up team by team inside an organization, resulting in a disjointed approach that isn’t scalable or cost-efficient.
Think of how data science is typically introduced into a company today: Usually, a line-of-business organization that wants to make more data-driven decisions hires a data scientist to create models for its specific needs. Seeing that group’s performance improvement, another business unit decides to hire a data scientist to create its own R or Python applications. Rinse and repeat, until every functional entity within the corporation has its own siloed data scientist or data science team.
What’s more, it’s very likely that no two data scientists or teams are using the same tools. Right now, the vast majority of data science tools and packages are open source, downloadable from forums and websites. And because innovation in the data science space is moving at light speed, even a new version of the same package can cause a previously high-performing model to suddenly — and without warning — make bad predictions.
The result is a virtual “Wild West” of multiple, disconnected data science projects across the corporation into which the IT organization has no visibility.
To fix this problem, companies need to put IT in charge of creating scalable, reusable data science environments.
In the current reality, each individual data science team pulls the data they need or want from the company’s data warehouse and then replicates and manipulates it for their own purposes. To support their compute needs, they create their own “shadow” IT infrastructure that’s completely separate from the corporate IT organization. Unfortunately, these shadow IT environments place critical artifacts — including deployed models — in local environments, shared servers, or in the public cloud, which can expose your company to significant risks, including lost work when key employees leave and an inability to reproduce work to meet audit or compliance requirements.
Let’s move on from the data itself to the tools data scientists use to cleanse and manipulate data and create these powerful predictive models. Data scientists have a wide range of mostly open source tools from which to choose, and they tend to do so freely. Every data scientist or group has their favorite language, tool, and process, and each data science group creates different models. It might seem inconsequential, but this lack of standardization means there is no repeatable path to production. When a data science team engages with the IT department to put its model/s into production, the IT folks must reinvent the wheel every time.
The model I’ve just described is neither tenable nor sustainable. Most of all, it’s not scalable, something that’s of tantamount importance over the next decade, when organizations will have hundreds of data scientists and thousands of models that are constantly learning and improving.
IT has the opportunity to assume an important leadership role in creating a data science function that can scale. By leading the charge to make data science a corporate function rather than a departmental skill, the CIO can tame the “Wild West” and provide strong governance, standards guidance, repeatable processes, and reproducibility — all things at which IT is experienced.
When IT leads the charge, data scientists gain the freedom to experiment with new tools or algorithms but in a fully governed way, so their work can be raised to the level required across the organization. A smart centralization approach based on Kubernetes, Docker, and modern microservices, for example, not only brings significant savings to IT but also opens the floodgates on the value the data science teams can bring to bear. The magic of containers allows data scientists to work with their favorite tools and experiment without fear of breaking shared systems. IT can provide data scientists the flexibility they need while standardizing a few golden containers for use across a wider audience. This golden set can include GPUs and other specialized configurations that today’s data science teams crave.
A centrally managed, collaborative framework enables data scientists to work in a consistent, containerized manner so that models and their associated data can be tracked throughout their lifecycle, supporting compliance and audit requirements. Tracking data science assets, such as the underlying data, discussion threads, hardware tiers, software package versions, parameters, results, and the like helps reduce onboarding time for new data science team members. Tracking is also critical because, if or when a data scientist leaves the organization, the institutional knowledge often leaves with them. Bringing data science under the purview of IT provides the governance required to stave off this “brain drain” and make any model reproducible by anyone, at any time in the future.
What’s more, IT can actually help accelerate data science research by standing up systems that enable data scientists to self-serve their own needs. While data scientists get easy access to the data and compute power they need, IT retains control and is able to track usage and allocate resources to the teams and projects that need it most. It’s really a win-win.
But first CIOs must take action. Right now, the impact of our COVID-era economy is necessitating the creation of new models to confront quickly changing operating realities. So the time is right for IT to take the helm and bring some order to such a volatile environment.
Nick Elprin is CEO of Domino Data Lab.
VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn more about membership.