Posts Tagged ‘big data’

Small Data or Big Data – Which Matters Most for AI?

Monday, August 27th, 2018

By Jeff McDowell, COO at Primal

In the past year, we have seen countless headlines about how artificial intelligence (AI) will transform business. AI promises to provide insight into data and customers at a level of individualization never seen before. In response, many companies are scrambling to capture and store as much data as possible – but in doing so they might be increasing their exposure to data breaches, privacy violations, and hacks.

Unfortunately, by taking a standard “machine learning only” approach to AI, we may not get far out of the starting blocks to achieve the vision of an AI solution that can understand data at a high level of fidelity. Many people assume that storing and analyzing large amounts of information (“big data”) through machine learning is the only way to take advantage of AI. But machine learning approaches can be actually be ineffective in understanding the meaning of text or the interests of individuals with any sort of specificity. Any company serious about AI needs to develop a solution that is both more targeted and more secure. I believe the way forward lies in integrating small data analysis into a big data approach.

Here are a few reasons to consider small data:

Big data techniques can be expensive and ineffective at high levels of specificity: Just like satellite imagery provides a broad picture of geospatial data of a physical lake, today’s big data approaches do the same with data lakes. When statistical methods of AI are applied to a big data environment, the output is usually very generalized and lacks fidelity. For example, if a statistical model is looking at data about sports fans, it may see a pattern that groups people into categories such as “baseball enthusiast”, “football enthusiast”, etc. These broad categories lose sight of the fact that some users are actually a pitching enthusiast, or a statistics junkie, or a part-time umpire. Knowledge of these narrower topics would be extremely useful to advertisers of niche products, yet big data platforms today are very limited in identifying and exposing these higher fidelity interest categories. This is because processing and storage becomes increasingly more expensive and complex when analyzing large amounts of data to achieve higher levels of specificity.

Integrating small data analysis is the key to making AI meaningful: Small data simply refers to the quantity of data available to train models. It’s often defined as the amount of information that can be processed by one computer, but it could be even smaller than that – a spreadsheet, a document, an article, or even as small as a social media post. “Small data” can even be found within large data sets. Instead of applying statistical collaborative filtering techniques to a group of people to infer broad interests which are hit and miss, taking an approach that applies semantic or symbolic techniques to small data can look at an individual to understand exactly what they are interested in, no matter the level of specificity. For our baseball example, a small data approach would analyze the meaning and context of a person’s blog or social media post, and pick up the nuance between someone who likes statistics vs someone who is interested in pitching techniques.  

Small data approaches increase explainability and reduce potential for bias: One of the criticisms of AI is that it operates in a “black box”, where it can be difficult to determine the reasoning behind a specific output. Numerous organizations – including the National Institute of Standards and Technology (NIST) – have called for a more balanced and thoughtful approach to developing AI solutions, to ensure they are trustworthy and explainable. AI outputs based on small data are inherently easier to interpret by humans. AI systems which analyze and categorize users based on large data sets also have the risk of introducing biases over time – a problem that can be mitigated by integrating analysis of small data, which can serve as a self-correction against bias.

AI has huge potential to augment our human intelligence and make us more productive. The importance and power of small data for AI is still on the fringes of being understood, but will gain momentum as businesses and consumers increasingly expect a greater level of relevance and security from AI. Even Eric Schmidt, former CEO of Google, recently tweeted, “AI may usher in the era of ‘small data’ – smarter systems can learn with less to train on.”

The current model of statistical analysis of big data is ‘good enough’ for now, but not sustainable. For AI to really be relevant, efficient, and safe, big data must be balanced by a robust small data processing activity.

Big data’s creative and empathetic sibling (and why they struggle to get along)

Wednesday, November 13th, 2013

Joyce Hostyn argues that Better Human Understanding, Not Big Data, Is the Future of Business. Some excerpts (with my emphasis):

Despite the best of intentions, we’re not data driven, we’re hypothesis driven. Our stories (our mental models) are merely hypotheses of how the world works. But we see them as reality and they influence what data we collect, how we collect it and the meaning we glean from it…

In a quest to become data driven, are marketers trapping themselves with outdated mental models of data and analytics? “Big data is being wasted on marketing. The true power of analytics is in revealing cultural dynamics.”

Many echo these concerns about data-driven marketing, and the need to be skeptical and hypothesis-driven. (1) (2) (3)

Hostyn concludes her thoughtful article with a number of questions, including:

  • Can we leverage big data to zoom out and understand patterns and trends, then zoom back in for a dive deep into the hearts and minds of individuals?
  • Are we willing to develop hypotheses with the potential to disrupt our old mental models? Create experiments to test those hypotheses. Prototype to think. Collect feedback. Iterate.

At Primal, we’ve invested years exploring this mode of hypothesis-setting as a lens into big data. It involves a collaboration between humans and machines across the full spectrum of analytical and synthetical thinking.

What follows is a summary of that exploration and what we’ve learned to this point.


Content Assistants for Messaging and Social Media

Wednesday, June 26th, 2013

Small conversational data such as tweets or text messages are a goldmine of individual interests. Millions of people everyday tweet about their favourite food, send a Kik or Facebook message about a recent TV episode, or take a photo on Instagram while attending a sporting event or concert.

However if you’re trying to recommend content or promote offers within a conversation, the ability to determine a user’s interests from these small data inputs is a huge challenge.

John Koestier in Venture Beat writes:

…tweets are difficult to register commercial intent…

If I tweet about my wife’s illness, are you going to target me with a random medicine? Or if you tweet about a great dinner you’re just about to eat, will you really be receptive to ads about a Greek restaurant just down the road?

In this post, we’ll show you an elegant solution for content recommendations, applied in messaging and social media platforms.

The Myth Behind Big Data and Privacy

Tuesday, June 11th, 2013

We know what personalization means and the compromises it imposes on our individual privacy.

Or do we?

This is perhaps the most insidious myth among the technorati: In order for people to benefit from advanced and personalized technologies, they need to compromise their individual privacy.

This idea is remarkably pervasive and damaging, driving both consumers and businesses away from the opportunities of personalization and next-generation information services.

In this post, I’m going to introduce you to the myth and the underlying villain, Big Data. I’m also going to argue that innovation is a much better path forward than evil, or doing nothing at all.

Why The "Star Trek Computer" Needs Open Data…And Scotty, Too

Friday, May 24th, 2013

Do you share this opinion?

So given that, what can we say about the eventual development of something we can call “The Star Trek Computer”? Right now, I’d say that we can say at least two things: It will be Open Source , and licensed under the Apache Software License v2. There’s a good chance it will also be a project hosted by the Apache Software Foundation.

The rationale? ASF provides an awesome array of advanced technologies, in everything ranging from NLP, information extraction and retrieval, machine learning, Semantic Web, and on and on. It’s like a free, all-you-can-eat buffet! (er, Star Trek food synthesizer?)

I share the enthusiasm for open source tools. We use many of these technologies at Primal.

But this is where their science fiction story starts to lose me:

Of course, you don’t necessarily need a full-fledged “Star Trek Computer” to derive value from these technologies. You can begin utilizing Semantic Web tech, Natural Language Processing, scalable machine Learning, and other advanced computing techniques to derive business value today.

We often meet product developers and entrepreneurs looking to build next-generation intelligent solutions.

If these advanced technologies are available for free, why not just jump in and start building?

A Reality Check on Big Data

Saturday, March 3rd, 2012

The fervor around big data continues to grow. The World Economic Forum and The New York Times are jumping on the bandwagon. While we share their enthusiasm for the potential, big data needs a reality check.

Here are just a few of the how-do-you-get-there-from-here questions for anyone considering big data projects. (more…)

Where Big Data Fails…and Why

Friday, October 14th, 2011

Data-driven technologies are plagued with small data problems. Their performance suffers in markets that aggregate a large number of unique interests. Some of the largest markets share these small data characteristics, including local ecommerce, personalized media, and interest networking. New approaches are needed that are far less sensitive to the cost and complexity of the data.

Read the full post on Medium