Skip to main content
BlogDatabasesMountains of Data: Big vs Small and Wide

Mountains of Data: Big vs Small and Wide

Mountains of Data: Big vs Small and Wide

If you’re in the tech industry (and probably even if you’re not), you’ve been hearing a lot about AI. I’m not just talking about the “Skynet is taking over the earth” type of AI from science fiction that we’ve all enjoyed over the years, but the practical application of artificial intelligence and machine learning in our day-to-day lives.

The lifeblood and sustenance of AI/ML is big data. Huge data. Massive amounts of data. Or is it? Big Data has been the engine feeding today’s AI/ML, and while we may always need sheer volume, in recent years organizations have started shifting from Big Data to Small and Wide

Let’s compare the two.

Heaps of Data 

Big Data can be broken down into two ways.

The first is to gather and organize a large dataset—a simple concept that can be difficult to execute well. That process requires a high volume of quickly populating, and typically unstructured data. The back-end infrastructure to accommodate this data stream is resource intensive and involves network bandwidth, storage space, and processing power to support massive database deployments. And it’s  expensive.

The second method gets trickier. Once you have a massive heap of data, you need to extract insight and value from it. Technologies have evolved to accommodate the size of big data, but there’s been less progress on determining what can be derived from these mountains of information.

This is when it’s time to get smarter. Even environments with infinite storage space and the perfect NoSQL deployment, all the data in the world won’t mean anything if you don’t have the right models to match. 

There’s an opportunity here as well. Companies are finding use cases where less data from more sources is more practical and are drawing better conclusions and correlations from datasets.

Small and Wide

With a small and wide approach, you’re looking at a greater variety of sources, searching for correlations, and not just increasing the raw quantity. This more tactical approach requires less data resulting in fewer computing resources. Variety is the name of the game, and going small and wide means looking for diverse data formats, structured and unstructured, and finding links between them.

According to a Gartner report in 2021: “Potential areas where small and wide data can be used are demand forecasting in retail, real-time behavioural and emotional intelligence in customer service applied to hyper-personalization, and customer experience improvement.”

There’s a lot of potential, but what does this look like in practice? Massive datasets can become unwieldy or outdated quickly. Human trends and behaviors can turn on a dime in the information age, prone to cultural and economic shifts. There is room for more agile models using smaller datasets that can dynamically adapt to these changes.

A report from the Harvard Business Review explains that “many of the most valuable data sets in organizations are quite small: Think kilobytes or megabytes rather than exabytes. Because this data lacks the volume and velocity of big data, it’s often overlooked, languishing in PCs and functional databases and unconnected to enterprise-wide IT innovation initiatives.”

The report describes an experiment they conducted with medical coders that highlighted human factors in training AI with small data. I recommend reading through this study but the ultimate conclusion was that in addition to small data, considering the human element can improve models and give organizations a competitive advantage in the big data arms race.

In other words, we’re talking about small, wide, and smart data as a winning combination.

Drawing Conclusions

What does all this mean? Many volumes could be, and have been written on this subject, but let’s take a quick, holistic look for a take-home message. I like my PC strong and powerful enough to serve as a heating source for my home office, but there comes a time when “more” has a limit. A piece of software can be poorly optimized and run terribly, even with the highest-end workstation. 

In many cases, throwing more resources at a problem is impractical and overlooks the real issues. More often, there’s a great opportunity for improvement, and this is something we’re starting to see with big data today. There are still use cases where a sheer volume of data is truly necessary, but it’s also important to design models to get the best use of data and not just design methods to have the most data.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *