There are several definitions of Big Data. Half a century after the computers entered mainstream society, the data has begun to accumulate to the point where something new and special is taking place.

With vast amounts of data available now, companies in almost every industry are focused on exploiting data for a competitive advantage.

The number of internet sites has been estimated to more than 1 billion with a yearly growth of 5,1% (estimation done by Netcraft).

The volume and variety of data have far outstripped the capacity of manual analysis and the capacities of one computer.

The world is awash with more information which is growing faster and faster. So is born the term “Big Data”.

Big data refers not only to a high volume of data but also to new processing technologies, like data science and data mining. At a high level, “data science” is defined as a set of fundamental principles that guide the extraction of knowledge from data. “Data mining” is defined as the extraction of knowledge from data, via technologies that incorporate these principles.


About data mining, the experts say that this is just a start. The era of data mining challenges the way we interact with the world. As from our past, most of the time we wanted to look after causality in order to get an explanation. This should change by looking after correlations between data. Correlations between data will contribute to really extract “information” from these data.

A general accepted definition of “information” is a quantity that reduces uncertainty about something.

In the specialized literature, the first fundamental concept of Data mining can be defined as follows: “extracting useful knowledge from data to solve business problems by following a process with well-defined stages”

One of the well-known areas where data mining is used, is customer relationship management which aims to analyze customer behavior in order to maximize expected customer value.

A very well-known example is “Predictive customer churn”. Customers switching from one company to another is called “churn”.

It is quite expensive for companies, as they spend on incentives to attract customers while they lose revenue when customer departs. Prevent churn has become a key strategic component for many companies in the world.

Each data-driven business decision-making problem is unique, comprising its own combination of goals, desires and constraints.

Let’s give following examples showing two quite different business problems:

a) A few years ago, the CEO of Amazon intended to recommend specific books to customer based on their individual shopping preferences. From its start, the company looked mainly at the purchases. Amazon processed initially the data in the conventional way. The resulting recommendations were crude: if a customer buys thrillers, the recommendation was to buy “thrillers”.

However this does not mean the customer could not be interested by other books. The company realized that the recommendation system didn’t need to compare people with other people and decided to find associations or correlations among products: other books (not only thrillers) but also others products like movies.
This has generated far better and more appropriate recommendations for the customers, increasing the client’s satisfaction.

b) In the New York Times in 2004 we could read the following story:

Hurricane “Frances” was on its way threatening a direct hit on Florida’s Atlantic coast. The CIO of Wal-Mart pressed her staff to come up with forecasts based on what had happened when hurricane “Charley” struck several weeks earlier. It is evident that clients will buy more bottles of water, for example.
Backed by the trillions of bytes stored in data warehouse, the CIO felt that the company could start predicting what’s going to happen, instead of waiting for it to happen.

Data mining has contributed to determine for which products precisely the demand would highly increase. Wal-Mart has anticipated successfully these unusual demands for products.

There are a large number of data mining algorithms developed over the years. It is time now to delve into one of them, called “Classification and class probability estimation”. This algorithm attempts to predict for each individual in a population, which of a set of classes this individual belongs to.

If we consider again the “Predictive customer churn”, the classification algorithm determines which class (churn or not) one specific customer (called “an individual”) belongs to and with which probability.


While the word “model” can be defined as the “simplified representation of a reality”, a predictive model is a formula which aims to estimate an unknown value, called the target.

With other words, predictive models focus on estimating the value of some target variable. The prediction of the target variable is done by looking after correlations between (1) some variables (called attributes) and (2) the target variable, which we want to predict.

The objective is to identify which attributes are really relevant for the prediction.

Multiple attributes can be correlated with the target variable: age, address, sex, income, length with company, overage charges, data usage, number of calls to support, …

Let’s come back to the example of the Customer Churn: the objective is to try to estimate the likelihood that a customer will depart.

The target variable is thus “customer departs or not”: there are only two values possible: “Depart or Not”.

Based on the historical set of data, we can calculate the percentage of clients that have left the company. For example, let’s assume that 10% of clients have left.

Currently a company generally calculates a cost estimation based on this information, as we already explained above.

However, with the rights attribute(s) correlated to the target, the management could get a more appropriate information in order to reduce the “churn” cost.

By using attributes, we classify the data and thus define subset of data. For these subset of data we also calculate the percentage of churn customers.

Let’s take as attribute of the customers, the age: > 50 or <= 50.

The question is to determine (1) whether the attribute “age” brings gain in information and (2) how much gain in information this attribute gives about the value of the target.

The formula (called entropy) evaluates the possible gain in information:

As more the formula “entropy” approaches 100% or 0%, as more this attribute contains an important information about the target: 100% means no correlation at all with the target and 0% means full correlation: these measures are really informative.
If the result of this calculation amounts to around 50%, this would mean that the attribute does not give concrete information about the target.

This calculation has to be done for several attributes in order to detect the more informative ones.


After having determine the best informative attributes, we can obviously combine several of them and determine whether this combination brings more information about the target: this is still done by the same formula “entropy”.

We can visualize combination of attributes by using trees, like this one:

In this example we have combined two attributes: Revenue and age.

We see that we have four subsets of data for which we can calculate the entropy and check them with the entropy of each attribute calculated separately in order to determine whether and how much information gain has been created by these combinations.

At the end of this process, we could have learned that customers older than 50 and whose revenue > 5K are potentially faithful ones (eg with a probability of 95%), while customers younger than 50 and with revenues < 5K are less faithful (eg with a probability of 80%).

The company can thus adapt its marketing approach based on this information.


For Banks and Insurance companies, it is very important to focus on the customers’ needs as today’s customers have high expectations of the ways of how they interact with their banks or insurance companies. Financial players must be able to carefully understand customer preferences and motivation.

Big Data technology can improve the predictive power of risk models, exponentially improve system response times and effectiveness, provide more extensive risk coverage, and generate significant cost savings by providing more automated processes, more precise predictive systems, and less risk of failure. Risk teams can gain more accurate risk intelligence from variety of sources in nearly real-time.

There are many areas where Big Data and more specifically data mining can apply and bring value. For example:

Risk management / Fraud management: allow banks to make sure that no unauthorized transactions will be made, providing a level of safety and security that will raise the security standard of the entire industry.

Client segmentation: Big Data will give banks deep insights into customer spending habits and patterns, simplifying the task of ascertaining their needs and wants. By being able to track and trace each and every customer transaction, banks will be able to categorize their clients based on various parameters, including commonly accessed services, preferred credit card expenditures, or even net worth. The benefit of customer segmentation is that it allows banks to better target their clients with relatable marketing campaigns that are tailored to cater to their requirements.

Cost savings: Big Data will expand the banking industry in a way that will allow them to earn mores revenue through cost reduction. And by cutting down on unnecessary costs, the banking industry can provide customers with exactly what they’re looking for, instead of irrelevant information.

Customer switching from one company to another (Churn): Big Data can help preventing churn by offering special deal retention deal prior to the expiration of the contract.

As the volume of banking customers’ increases, it is almost bound to affect the level of service offered. But it is important for the banks to be on top of everything as they are responsible for the security of their clients’ funds, as well as their personal data. Small scale databases simply cannot keep with the increasing volume of information. So, if the banking sector fails to successfully implement Big Data, their databases are almost certain to fail. Switching to Big Data will allow them to process this information faster, avoiding any potentially embarrassing situations.

The cloud presents a huge opportunity for the evolution of the banking sector, which has remained largely unchanged over the years. And although there are concerns related to data security, Big Data can offer a number of advantages for, both banks and their customers.

By Pierre Vanden Weghe consultant at Initio


  • La Révolution Big Data. JC Cointot / Y Eychenne
  • Big Data et Machine Learning. Pirmin Lemberger, Marc Batty, Médéric Morel, Jean-Luc Raffaelli

Laisser un commentaire

Entrez vos coordonnées ci-dessous ou cliquez sur une icône pour vous connecter:


Vous commentez à l'aide de votre compte Déconnexion /  Changer )

Photo Google+

Vous commentez à l'aide de votre compte Google+. Déconnexion /  Changer )

Image Twitter

Vous commentez à l'aide de votre compte Twitter. Déconnexion /  Changer )

Photo Facebook

Vous commentez à l'aide de votre compte Facebook. Déconnexion /  Changer )


Connexion à %s