Subscribe to our Newsletter

Vincent Granville's Posts (8)

Guest blog post by Rick Riddle.

The word wide web is turning into a colossal heap of data that is being stored at hundreds and thousands of datacenters across the world. According to a recent research made by Data Science Central, the size of data on the internet is expected to double in every two years. Such amount of data is not only hard to be stored but it is also posing a challenge for organizations and businesses to use this data for making useful decisions.

Ever since the idea of big data started making headlines in the cyber fraternity, organization have been trying to understand and make good use of this phenomenon. This is why businesses are pouring millions of dollars into the research while have already come up with tools to make it easier for organizations to make decisions based on the results from data sets.

What exactly is big data?

Wikipedia defines big data as the term for datasets that are so large or complex that traditional data processing applications are inadequate to deal with them. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy.

There is no hard and fast rule to call any information big data, however, any kind of information that needs new tools and techniques to be processed could be big data.

The processing of such data is to be done by groundbreaking technologies compared to the old hardware that the industry has been using for decades.

Is big data even worth investing in?

There is currently an on-going debate among the stakeholders on how useful it could be investing in this industry. The good sign is that a number of organizations have already been able to turn the idea of big data into lucrative businesses. Although there is a lot more to be done both in terms of research and practical development, this new arena in the digital landscape cannot be ignored.

Universities and independent think tanks have termed big data as the next big thing. It has been the buzzword among technical groups for quite some time now. There is steady development being made in data storage, computation, and visualization which gives a lot of hope for the future of this industry.

Another reason why the industry is attracting huge investments from corporations is that the field is quite new and by investing time, money and manpower in the right place at the right time, companies can take lead and set an example for others to follow.

How can big data be analyzed?

One of the most important question about big data is how it can be analyzed given its huge size and complexity that an ordinary analytical software cannot manage? The most common way to analyze big data is by using a method called MapReduce. The process includes processing data sets in a parallel model. The process itself includes two part, Map function and reduce function.

The Map function does all the filtering and sorting of data. It then categorizes each of the processed datasets in a structured form so that it can later be analyzed easily.

The next part of the process is the Reduce function. The Reduce function creates a summary of all the data that was categorized in the previous step.

How can one kick-start the big data journey?

Although there is too little research work that is available to the public, there are only countable organizations that have come up with useful ways to harness big data in making real-life decisions. There are tools that are in the process of development and are only available for researchers.

Here are some important tools and resources that a big data enthusiast can use to bootstrap the process of learning big data:

  1. Hadoop: When we talk about data, storage is a primary concern. This is where you can use Hadoop to organize racks of data. The system is built on the idea to simplify the complication of data storage hence; you will not need to worry about getting into old database systems that are either too complex for such a task or are expensive to be implemented on such a scale.
  2. BI Suite: The groundbreaking technology of storage needs a whole new level of reporting software. Jaspersoft’s BI Suite comes packed with all the utilities that can read, understand and produce reports from the tons of database tables that are otherwise too complicated to be processed. BI Suite can read data from powerful databases like MongoDB, Neo4j, and CouchDB etc.
  3. Splunk: Indexing of data is another big challenge in the big data arena. Luckily, there are tools like Splunk that make it easier for you to keep the record of data through indexing. It is more like creating sets of metadata for existing huge amount of data. It then makes easier for visualization software to get results depending on the query you have made.

The mentioned tools in above list offer limited tasks and are meant for medium scale operations. There are a number of advanced tools like the Tableau Desktop and Server, Talend Open Studio etc., that have very specific purposes and one can use them for advanced visualization, queries, and data processing tasks.

Importance of setting goals for big data campaigns

There is no doubt that big data has a lot to offer with the promise to get results. At the same time, it is very important to set goals. When you start digging for useful information from the millions of terabytes of data, you are potentially exposed to the risk of over-spending your resources. To make sure you are always in the right direction throughout the whole process, it is equally important to carefully craft an execution plan that minimizes the risks of failure.

The basic step towards achieving successful results from big data campaigns is to choose the right tools. If you choose a wrong tool for any operation, you could put yourself in a state of confusion. This is why you need to bring on board people who could plan and optimize the various steps involved so that you could not only save a lot of time but monetary resources as well.

A good example of going forward in the right direction is to have a resource manager who understands the complexities of data science. A good resource manager will not only make the right selection of tools but will also bootstrap the process to save you time and money.

The next person who is going to help you in your ambitions is a big data scientist. This person would have all the insights of the industry and can point out many things that otherwise could go unnoticed. Data scientists also have the capability to predict the results of steps being taken. This way you can further minimize the risks with the help of right people in the right place.


The process of making real life decisions based on big data does not only include the selection of right hardware and compatible platform but it also includes putting at work the right people as well. Organizations can indeed make decisions that could have a huge impact on creating intelligent solutions. Just a couple of years ago, processing such amount of data was only a dream. Owing to the amazing advancements being made in the field of computing and data sciences; these dreams are becoming true.

On top of that, there are a number of tools so inexpensive and resources with ease of access; no matter how big or small your company is, it can benefit from big data and pave a path towards a future with no limit of success.

Read more…

Guest blog post by Diana Beyer.

Database marketing is the process of collecting, analyzing, and identifying important information about you customers.

The information comes different channels like sales records, email, cards warranty and etc.

With this is information you can sharpen your marketing campaigns to focus on what’s important; the client.

Database marketing techniques help you with:

  • Improving all your marketing efforts
  • Increasing sales
  • Retaining and getting more customers

If you’re still not convinced, imagine having a complete profile about your customers. You’d know their age, gender, buying habits, the things they need, and any other possible information.

Any business with such complete information can use it to increase sales and ultimately profits.

This is like stumbling in a golden mine that only you know about.

Customers want to feel special, and the only way to give them that is by learning as much as you can about what they want and need.

So below you’ll learn techniques that are essential for you to successfully use your database and boost your sales.

Customer Lifetime Value

The CLV is a model widely used as a database marketing technique that can estimate what the lifetime relationship between the customer and the business will be worth to the latter. This technique has gotten more accurate throughout the years and is used by any respectable marketing expert.

With this information, businesses can make a precise estimation of their revenue, therefore company, potential. This data also helps you evaluate what are the areas that you can improve to increase this number.

Customer Communications

The amount of data a business produces can give enough information about who buy their products, and that opens the possibility for improvement in many different areas.

Customer communication is one of them. Clients love to feel special and that their purchase is important. So the best way to do that is by using the data on your customer to improve the channels that you use to communicate with them.

A few reasons for you to use your database in customer communication

  • Clients that feel important buy more
  • Increases Loyalty to your brand
  • Increases sales

RFM (Recency, Frequency, Monetary Analysis)

No business wants to waste their time promoting something to a one-time customer or someone that subscribed to your website and never bought anything. This is when this technique comes in handy.

RFM has been around before the internet, but many companies don’t use it right. First, RFM is a technique that calculates the customers that made recent purchases, that bought something recently, and that spends the most in products.

With this kind of data, you can target your marketing efforts on specific customers that will bring you the desired conversion.

RFM helps you with:

  • Not wasting time and money in the wrong prospect
  • Creating a focused marketing campaign
  • Achieving great conversion rates by using the 80/20 rule

Analytical Software

Analytical software is the next step in database marketing. With this technique, you have a software interpreting your data.

In today’s world, you have a website, app mobile, social media, and other channels that gather a lot of data on your customers, and having a software that helps organize all that for you can be a life-saving.

So the software is really useful if you don’t have an expert to analyze all the data generated by your customers to learn and predict their behaviors.

Loyalty Programs

As you will notice throughout this article, the database allows you to be focused. This means getting as specific as you can about who your clients are.

All the purchases that your customers make generate a sales history. With this information, you can create loyalty programs.

You basically determine what your customers want the most from your business and offer them in a loyalty program. This is the best way to secure sales and beat the competition before there’s even one.


With the advent of social media and many other tools, email might have become obsolete for a few things, but what email still does best is sales.

Any blog or website has a field where you can subscribe using your email. With email, you can better communicate with your customer and learn their needs.

Email is a great sales tool; since you can also see which emails your clients are opening and which ones they’re not, using that to improve your approach.

Customer Segmentation

Gone are the day when salespeople knew their customers’ names by memory. Today you sell for thousands maybe even millions, so you have to keep a database to store all the information.

Customer segmentation is when you separate this information in demographics. So you can create segments and marketing efforts for each segmentation like young males.

This is really helpful for companies that have a diverse customer base. You can attend everybody’s needs, keeping all your clients happy.

Multi-channel marketing

This is usually referred to the customer that buy retail, catalog, and online. The database that tackles all the angles to give the best experience for the client no matter where they’re buying is ideal.

Clients that buy in multi-channels buy more than single-channel buyers. But you also have to remember that nowadays people are buying through mobile apps and social media, and not just websites.

This is all great news, the more you improve the customer experience throughout all your channels the more loyal clients you’ll have.

Profitability Analysis

This is similar to the RFM technique, but they’re not the same. With profitability analysis, you can measure the profitability of each customer.

Before this, it was hard to quantify you customer’s profitability. Now you can use that to refine your marketing efforts and your pricing strategies.

Profitability analysis helps you with:

  • Discovering unprofitable customers
  • Learning what your profitable customers want and how much they’re willing to pay for it
  • Increasing your profits

Treating customers differently

At first, this may sound horrible, but every big company does it. Think for instance banks and airlines.

Remember the 80/20 rule, 80% of your revenue comes from 20% of your customers. So it only makes sense to spend more money on your marketing efforts -that you wouldn’t be able to spend otherwise on all your customers- on them.

This way you retain them by creating satisfaction, and you also incentive the other customers to move up the ladder.

So you can:

  • Create Gold and Platinum programs
  • Give special discounts
  • Offer special services

Penetration Analysis

This technique can be really powerful if used right. Using the database, you can make an analysis to determine how much sales you’re making in each zip code, or age group, or income.

You can then use this information to fix a possible problem and increase your sales in that segment.

Predictive Models

This database marketing technique can be really power when combined with customer communication.

It consists of the use of the data that you have on your customers to predict which ones will respond to a marketing campaign and which ones will ignore it.

This information helps you to refine your campaign to increase the number of conversions and reduce the number of customer attrition.

So predictive models help you with:

  • Improving marketing campaigns
  • Reduce customer attrition
  • Increase sales

Appended Data

Appended data is when you link varied information to a name or address. In marketing, you’d use this to organize the information on your customer more efficiently.

For example, you’d put the name of the client with their address, and other information like gender, age, ethnicity, income, net worth, and etc.

With this information, you can create predictive models. You can learn more about your customer, and what they are more likely to buy.

This helps you offer a better service that has everything your clients need.


In this day and age, every business needs a website, even if you don’t sell online. A personalized website with cookies can make wonders for your business.

A smart, interactive website can bond with the client and even sell. The website is also important because clients like to do their research before buying anything. And if you don’t have a web presence they may think you’re not legit.


Remember to use these techniques so you can:

  • Improve your services
  • Increase the value of your customer
  • Retain and get more clients
  • Increase sales
  • And increase profit

So if you use this powerful database marketing techniques, you’ll be on your way to optimizing your whole marketing campaign.

Read more…

How to Become a Data Scientist for Free

Guest blog post by Zeeshan Usmani.

Big Data, Data Sciences, and Predictive Analytics are the talk of the town and it doesn’t matter which town you are referring to, it’s everywhere, from the White House hiring DJ Patil as the first chief data scientist to the United Nations using predictive analytics to forecast bombings on schools. There are dozens of Startups springing out every month stretching human imagination of how the underlying technologies can be used to improve our lives and everything we do. Data science is in demand and its growth is on steroids. According to Linkedin, “Statistical Analysis” and “Data Mining” are two top-most skills to get hired this year. Gartner says there are 4.4 million jobs for data scientists (and related titles) worldwide in 2015, 1.9 million in the US alone.  One data science job creates another three non-IT jobs, so we are talking about some 13 million jobs altogether. The question is what YOU can do to secure a job and make your dreams come true, and how YOU can become someone that would qualify for these 4.4 million jobs worldwide.

There are at least 50 data science degree programs by universities worldwide offering diplomas in this discipline, it costs from 50,000 to 270,000 US$ and takes 1 to 4 years of your life. It might be a good option if you are looking to join college soon, and it has its own benefits over other programs in similar or not-to-so similar disciplines. I find these programs very expensive for the people from developing countries or working professionals to commit X years of their lives.

Then there are few very good summer programs, fellowships and boot camps that promise you to make a data scientists in very short span of time, some of them are free but almost impossible to get in, while other requires a PhD or advanced degree, and some would cost between 15,000 to 25,000 US$ for 2 months or so. While these are very good options for recent Ph.D. graduates to gain some real industry experience, we have yet to see their quality and performance against a veteran industry analyst. Few of the ones that I really like are Data IncubatorInsight Fellowship,  Metis BootcampData Science for Social Goods and the famous Zipfian Academy programs.

Let me also mention few paid resources that I am a fan of before I tell you how to do all that for free. First one is the Explore Data Science program by Booz Allen, it costs 1,250 $ but worth a single penny. Second one is recorded lectures by Tim Chartier on DVD, called Big Data: How Data Analytics is transforming the world, it costs 80 bucks and worth your investment. The next in the list are two courses by MIT, Tackling the Big Data Challenges, that costs 500$ and provides you a very solid theoretical foundation on big data, and The Analytics Edge, that costs only 100 bucks and gives a superb introduction on how the analytics can be used to solve day-to-day business problems. If you can spare few hours a day then Udacity offers a perfect Nanodegree for Data Analysts that costs 200$/month can be completed in 6 months or so, they offer this in partnership with Facebook, Zipfian Academy, and MongoDB. ThinkFul has a wonderful program for 500$/month to connect you live with a mentor to guide you to become a data scientist.

Ok, so what one can do to become a data scientist if he/she cannot afford or get selected in the aforementioned competitive and expensive programs. What someone from a developing country can do to improve his/her chances of getting hired in this very important field or even try to use these advanced skills to improve their own surroundings, communities and countries.

Here is my cheat sheet of becoming a Data Scientist for Free:


  1. Understand Data: Data is useless and can (and should) be misleading without the context. Data needs a story to tell a story. Data is like a color that needs a surface to even prove its existence, as color red for example, can’t prove its existence without a surface, we see a red car, or red scarf, red tie, red shoes or red something, similarly data needs to be associated with its surroundings, context, methods, ways and the whole life cycle where it is born, generated, used, modified, executed and terminated. I have yet to find a “data scientist” who can talk to me about the “data” without mentioning technologies like Hadoop, NoSQL, Tableau or other sophisticated vendors and buzzwords. You need to have an intimate relationship with your data; you need to know it inside out. Asking someone else about anomalies in “your” data is equal to asking your wife how she gets pregnant. One of the distinct edge we had for our relationship with the UN and the software to secure schools form bombings is our command over the underlying data, while the world talks about it using statistical charts and figures, we are the ones back home who experience it, live it in our daily lives, the importance, details, and the appreciation of this data that we have cannot be find anywhere else. We are doing the same with our other projects and clients.
  2. Understand Data Scientist: Unfortunately, one of the most confused and misused word in data sciences filed is the “data scientist” itself. Someone relate it to a mystic oracle who would know everything under the sun, while others would reduce it down to statistical expert, for few its someone familiar with Hadoop and NoSQL, and for others it is someone who can perform A/B testing and can use so much mathematics and statistical terms that would be hard to understand in executive meetings. For some, it is visualization dashboards and for others it’s a never ending ETL processes. For me, a Data Scientist is someone who understands less about the science than the ones who creates it and little less about the data than the ones who generates it, but exactly knows how these two works together.  A good data scientist is the one who knows what is available “outside the box” and who he needs to connect with, hire, or the technologies he needs to deploy to get the job done, one who can link business objectives with data marts, and who can simply connect the dots from business gains to human behaviors and from data generation to dollars spent.
  3. Watch these 13 Ted Videos
  4. Watch this video of Hans Rosling to understand the power of Visuali...
  5. Listen to weekly podcasts by Partially Derivative on Data Sciences and explore their Resourcespage
  6. University of Washington’s Intro to Data Science  and Computing for data analysis will be a good start
  7. Explore this GitHub Link and try to read as much as you can
  8. Check out Measure for America to gain an understanding of how data can make a difference
  9. Read the free book - Field Guide to Data Sciences
  10. Religiously follow this infographic on how to become a data scientist
  11. Read this blog to master your R skills
  12. Read this blog to master your statistics skills
  13. Read this wonderful practical intro to data sciences by Zipfian Academy
  14. Try to complete this open source data science Masters program
  15. Do this Machine Learning course at Coursera by the co-founder Andrew Ng of Coursera himself
  16. By all means, complete this Data Science Specialization on Coursera, all nine courses, and the capstone
  17. If you lack computer science background or want to go towards programming side of the data sciences, try to complete this Data Mining Specialization from the Coursera
  18. Optional: depends on the industry you like to work with, you may want to check out these industry specific courses/links on data sciences, healthcare analytics – intro and specializationeducation,performance optimization and  general academic research
  19. To understand the deployment side of data science applications, this cloud computing specialization from the Coursera and Youtube Amazon Web Services and free trainings are a must to do
  20. Do these second-to-none courses on Mining Massive Datasets  and Process Mining
  21. This link will lead you to 27 best data mining books for free
  22. Try to read Data Science Central once a day, articles like this can save you a lot of time and discussion in interviews
  23. Try to compete in as many Kaggle competitions as you can
  24. To put a cherry on the cake, these statistics driven courses will help you in differentiation from all other applicants – Inferential StatisticsDescriptive StatisticsData Analysis and StatisticsPassion drive stats, and Making Sense of Data
  25. Follow the following on Twitter for Predictive Analytics: @mgualtieri@analyticbridge@doug_laney,@Hypatia_LeslieA@hyounpark@KDnuggets, and @anilbatra
  26. Follow the following on Twitter for Big Data and Data Sciences: Alistair CrollAlex Popescu,@rethinkdbAmy HeineikeAnthony GoldbloomBen Lorica@oreillymedia., Bill HewittCarla Gentry CSPODavid SmithDavid FeinleibDerrick HarrisDJ PatilDoug Laney - Edd DumbillEric KavanaghFern HalperGil PressGregory PiatetskyHilary MasonJake PorwayJames Gingerich,James KobielusJeff HammerbacherJeff KellyJim HarrisJustin LovellKevin WeilKrish Krishnan,Manish BhattMerv AdrianMichael DriscollMonica RogatiNeil RadenPaul PhilpPeter SkomorochPhilip (Flip) KromerPhilip RussomPaul ZikopoulosRussell JurneySid Probstein,Stewart TownsendTodd LipconTroy SadkowskyVincent GranvilleWilliam McKnightYves Mulkers


The whole list will take 3 to 12 months to complete and will cost you absolutely nothing, and I can guarantee you that with this skills set you really have to try very hard to remain jobless. Even if you complete half of it, send me a note and I will have something ready for you.

Ball is in your court, it doesn’t matter where you are and how much you can afford, if you want to make at least four times higher the average income of your countrymen, this is the way to do it, at least for next 10 years (where we will be generating 20 TBs of data per year per person versus 1 TB of data per year per person in the last 10 years.)


I will write separate articles on Data Science Books (I’ve read 127 of those in last six months) and MOOCs (I am celebrating my 25th MOOC certification today).

For everyone else data sciences is an opportunity, for me it’s a passion

I tweet at @ZeeshanUsmani

Read more…

Walmart Kaggle: Trip Type Classification

Contributed by Joe EckertBrandon SchlenkerWilliam Aiken and Daniel Donohue. They took the NYC Data Science Academy 12-week full-time data science bootcamp program from Sep. 23 to Dec. 18, 2015. The post was based on their fourth in-class project (due after the 8th week of the program).


Walmart uses trip type classification to segment its shoppers and their store visits to better improve the shopping experience.   Walmart's trip types are created from a combination of existing customer insights and purchase history data.  The purpose of the Kaggle competition is to use only the purchase data provided to derive Walmart's classification labels.  The goal for Walmart is to refine their trip type classification process.


About the Data

  • ~ 96k store visits, segmented into 38 trip types
  • Training and testing data included >1.2 million observations with 6 features:
    • Visit Number, Weekday, UPC, Scan Count, Department Description, Fineline Number
  • Using the 6 provided features the team was tasked with creating the best model to accurately classify the trips into their proper trip type category
  • Challenges with the data
    • Each observation represented an item rather than a visit
    • Needed to group observations by visit to classify the trip
    • Number of unique UPCs and Fineline Numbers prevented the creation of dummy variables - resulting data set was too large to process
    • Instead, used the Department Description to create dummy variables


Model 1: Logistic Regression

Implemented multinomial logistic regression to determine trip type.  Normal logistic regression is used for two class predictions.  Multinomial logistic regression performs logistic regression on each class against all others.  The process is repeated until all classes are regressed one vs all.

  • Log loss score: 4.22834

import pandas as pd
import numpy as np
import scipy as sp
from sklearn.linear_model import LogisticRegression
import time

start_time = time.time()

waltrain = pd.read_csv('train.csv')
waltest = pd.read_csv('test.csv')
waltrain = waltrain[waltrain.FinelineNumber.notnull()]
waltrain_part = waltrain[:]
waltest_part = waltest[:]

model = LogisticRegression()
x = waltrain_part[['Weekday', 'DepartmentDescription']]
y = waltrain_part[['TripType']]
x = pd.get_dummies(x)
z = waltest_part[['Weekday', 'DepartmentDescription']]

zend = pd.DataFrame({'Weekday': ['Sunday'],
'DepartmentDescription': ['HEALTH AND BEAUTY AIDS']},
index = [len(z)])
z = z.append(zend)
z = pd.get_dummies(z), y)
print "The model coefficients are:"
print model.coef_
print "The intercepts are:"
print model.intercept_

print "model created after %f seconds" % (time.time() - start_time)

submission = model.predict_proba(z)
submissiondf = pd.DataFrame(submission)


dex = waltest.iloc[:,0]
submurge = pd.concat([dex,submissiondf], axis = 1)
avgmurg = submurge.groupby(submurge.VisitNumber).mean()
avgmurg.reset_index(drop = True, inplace = True)
avgmurg.columns = ['VisitNumber', 'TripType_3','TripType_4','TripType_5','TripType_6','TripType_7',\

avgmurg[['VisitNumber']] = avgmurg[['VisitNumber']].astype(int)
avgmurg.to_csv('KaggleSub_04.csv', index = False)

print "finished after %f seconds" % (time.time() - start_time)

Model 2: Random Forest

For the second model the team implemented a random forest.  Random forests are a collection of decision trees.  Classification is done by a 'majority vote' of the decision trees within the random forest.  That is, for a given observation the class that is most frequently predicted within the random forest will be the class label for that observation.

Engineered Features:

  • Total number of items per visit
  • Percentage of items purchased based on Department
  • Percentage of items purchased based on Fineline Number
  • Percentage of items purchased by UPC
  • Count of different items purchased (based on UPC)
  • Count of returned items
  • Boolean for presence of returned item


Below you can see the progression of the performance of the random forest as adjustments were made:


  • Best log loss score: 1.22730


Model 3: Gradient Boosted Decision Trees

Gradient boosted trees are a supervised learning method where a strong learner is built from a collection of decision trees in a stagewise fashion, where subsequent trees focus more on observations that were misclassified by earlier trees.

Engineered Features:

  • Day of the week (expressed as an integer)
  • Number of purchases per visit
  • Number of returns per visit
  • Number of times each department was represented in the visit
  • Number of times each fineline number was represented in the visit

For this model the team used the XGBoost and Hyperopt Python packages.  XGBoost is a package for gradient boosted machines, which is popular in Kaggle competitions for its memory efficiency and parallelizability.  Hyperopt is a package for hyperparameter optimization that takes an objective function and minimizes it over some hyperparameter space.  Unfortunately, we needed to split the training set into two halves (the prepared dataset was too large to keep in memory), train two XGBoost models, and then average their results.  Not training on the whole dataset is probably what resulted in the larger log loss score.

  • Best log loss score: 1.48

The code for this approach to the problem can be found here.



Given the size of the data set the accuracy achieved was limited due to memory constraints.  The best performance was achieved using random forest after implementing grid search for feature selection and parameterization.  Feature engineering was extremely important in this competition given that the rules restricted the use of external data.

Read more…

Guest blog by George Psistakis.

Working with data sets means that you have a way to get them first. After you get them you have to clean them.

Data scientists spend 80% of their time in data cleaning and data manipulation and only 20% of their time actually analyzing it.

And then you find yourself spending 80% of your time to clean these data. At the same time, deadlines and management demands keep you up at night.

This is one reason data analysts and data scientists regularly scour the web looking for anything that could help. Tools, tutorials, resources.

I have stumbled many posts around related with general Data Science MOOC courses or tutorials. But never one that has a list of resources on one of the most time-consuming processes in the data pipeline. Data cleaning.

In this post, I did my best to gather everything there is online. If you find a resource that I missed please let me know in the comments below.

Let's start with the basics...

What is data cleaning?

Data cleaning, data cleansing or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. 
Source: Wikipedia

Note: Some of the courses bellow belong to specializations or batches of courses. For example, Coursera has a Data Science specialization or Udacity's Nanodegree Program but you may also take each course individually. If you are interested in a certificate, then usually there is a fee. If not (for Coursera at least) you may "audit" the course. Other courses are free and others are subscription based services.

Data Cleaning in R

Getting and Cleaning Data (Coursera)

  • Course Name: Getting and Cleaning Data
  • Institution: Johns Hopkins University
  • Coursera Specialization: Data Science Specialization
  • Price: Free
  • Belongs to Coursera's Data Science Specialization from Johns Hopkins University and it is one of the best Data Cleaning courses out here.The course covers the basics needed for collecting, cleaning, and sharing data.

Data Science and Machine Learning Essentials (edX)

  • Course Name: Data Science and Machine Learning Essentials
  • Institution: Microsoft
  • Price: Free, paid for certificate
  • Another one of the best Data Science courses MOOC course. It covers tools like R, Python and SQL and among others covers data acquisition, ingestion, sampling, quantization, cleaning, and transformation.

Data Science with R (O'Reilly)

  • Course Name: Data Science with R
  • Price: Paid
  • It is part in one of O'Reilly's Learning Paths. It starts from the basics to more advanced techniques including R Graph and machine learning. It contains an intro to Data Science with R, how to manipulate data sets and expert Data Wrangling with R.

Cleaning Data in R (DataCamp)

Foundations of Data Science (Springboard)

  • Course Name: Foundations of Data Science
  • Price: Free (some chapters), Subscription based or one-time payment
  • It has a unit about Data Wrangling and data cleaning with R.

Udemy Courses

You may want to take a look at the list of resources about Data cleaning and R inside Udemy. There are a lot to choose from, but it might require some searching to find which one is valuable to you.


There are more courses about Python, SQL and more. You may read the rest of the updated and curated list here. If you know one that I miss please let me know! 

Read more…

The ABCD's of Business Optimization

I was trying to find some good domain name for our upcoming business science website, when something suddenly became clear to me. Many of us have been confused for a long time about what data science means, how it is different from statistics, machine learning, data mining, or operations research, and the rise of the data scientist light - a new species of coders who call themselves data scientist after a few hours of Python/R training, working on a small project at best, and spending $200 for their training. Thedata scientist light is not a real one, even though I believe that you can learn data science from scratch on the job, just as I did. 

This introduction brings me to the ABCD's, and the arguments are further developed in my conclusion below. These four domains are certainly overlapping. But I believe that identifying them brings more clarity about roles differentiation and collaboration.

  • Analytics Science. Deals with modern statistical modeling, predictive modeling, model-free (data-driven) statistics, root cause analysis, defining and selecting metrics, and traditional techniques such as clustering, SVM, linear regression, K-NN (whether you call it machine learning, AI, statistics or data science). Analytics scientists are true geeks with generic knowledge applicable to many domains. They may work on small or big data. Analytics science, unlike data science, is well documented in college textbooks.
  • Business Science. Deals with principles, both theoretical and applied, where domain expertise and deep cross-departments business understanding is critical. The purpose is to leverage analytics to deliver added value or increased profits. Business scientists might spend little time coding, unlike the three other categories. Examples of business science applications can be found in this article, and also in this article. It may overlap with BI, six sigmas and operations research. So it can definitely involve a great deal of statistical modeling, modern or not.
  • Computer Science: Deals with architecture (including real-time, distributed and cross-platforms such as IoT), algorithm design and refinement, platform design, Internet and communications protocols, data standards, systems engineering, software engineering and prototyping.
  • Data Science: Deals with data identification, collection, cleaning, summarizing, and insights extraction - even dashboards and visualizations that help with the executive decision process. Also includes advanced algorithms for big data, sensor data (IoT), black-box analytics, batch-mode analytics, automation of analytics processes. API's, analytics-driven systems based on machine-to-machine communications (for instance, automated bidding, fraud monitoring.) Typically, simple black-box, machine-controlled techniques are more difficult to design than complex man-controlled analyses, because they must be made very robust, as opposed to very accurate. Many data scientists also know quite a bit of analytics science, especially modern principles (typically not published in college textbooks) to deal with big, unstructured, fast flowing data. While in some ways computer scientists make data alive, data scientists take it from there and make it intelligent. 

I finally decided to call myself business scientist, as my experience is more and more aligned with this domain (being an entrepreneur), though, like many of us here, I have significant knowledge and expertise in all four domains, especially in data science and analytics science. My motivation to call myself abusiness scientist is also partly to not be confused with a data scientist light. This erroneous statement is sometimes brought against us (real) data scientists, by a minority of vocal analytics scientists. I believe that we need to dispel this myth. Part of the reason, I believe, is because math-free solutions that in addition, trade accuracy for robustness (in order to fit in black-box systems or be usable by the layman) are not respected by some traditional statisticians, who erroneously believe that automation and/or removing statistical jargon and mathematical background, is not possible. Maybe because it could jeopardize their jobs?

In the end, I want to make data science accessible to everyone, not to an elite of initiated, change-adverse professionals. It requires a new, unified, simple, efficient, math-free or math-light (but not data science light) approach to analytics problems and solutions, as well as algorithmic ingeniosity. This is feasible, but more difficult than producing extremely complicated statistical models - which is what I was doing earlier in my career

About the author: Vincent Granville worked for Visa, eBay, Microsoft, Wells Fargo, NBC, a few startups and various organizations, to optimize business problems, boost ROI or to develop ROI attribution models, developing new techniques and systems to leverage modern big data and deliver added value. Vincent owns several patents, published in top scientific journals, raised VC funding, and founded a few startups. Vincent also manages his own self-funded research lab, focusing on simplifying, unifying, modernizing, automating, scaling, and dramatically optimizing statistical techniques. Vincent's focus is on producing robust, automatable tools, API's and algorithms that can be used and understood by the layman, and at the same time adapted to modern big, fast-flowing, unstructured data. Vincent is a post-graduate from Cambridge University.

Follow us on Twitter at @ROIdoctor | Visit our e-Store

Read more…

Data Science Book - Table of Contents

This book is also part of our apprenticeship. Part of the content as well as new content is in a separate document called Addendum. Click here to download the addendum. The book is available on Barnes and Noble. Also, read our article on strong correlations to see how various sections of our book apply to modern data science. If you start from zero, read my data science cheat sheet first: it will greatly facilitate the reading of my book.

My second book - Data Science 2.0 - can be checked out here. The book described on this page is my first book.

About the Author

Dr. Vincent Granville is a well rounded, visionary data science executive with broad spectrum of domain expertise, technical knowledge, and proven success in bringing measurable added value to companies ranging from startups to fortune 100, across multiple industries (finance, Internet, media, IT, security), domains (data science, operations research, machine learning, computer science, business intelligence, statistics, applied mathematics, growth hacking, IoT) and roles (data scientist, founder, CFO, CEO, HR, product development, marketing, media buyer, operations, management consulting).

Vincent developed and deployed new techniques such as hidden decision trees (for scoring and fraud detection), automated tagging, indexing and clustering of large document repositories, black-box, scalable, simple, noise-resistant regression known as the Jackknife Regression (fit for black-box, real-time or automated data processing), model-free confidence intervals, bucketisation, combinatorial feature selection algorithms, detecting causation not correlations, automated exploratory data analysis with data dictionaries, data videos as a visualization tool, automated data science, and generally speaking, the invention of a set of consistent robust statistical / machine learning techniques that can be understood, implemented, interpreted, leveraged and fine-tuned by the non-expert. Vincent also invented many synthetic metrics (for instance, predictive power and L1 goodness-of-fit) that work better than old-fashioned stats, especially on badly-behaved sparse big data. Some of these techniques have been implemented in a Map-Reduce Hadoop-like environment. Some are concerned with identifying true signal in an ocean of noisy data.

Vincent is a former post-doctorate of Cambridge University and the National Institute of Statistical Sciences. He was among the finalists at the Wharton School Business Plan Competition and at the Belgian Mathematical Olympiads. Vincent has published 40 papers in statistical journals (including Journal of Number Theory, IEEE Pattern analysis and Machine Intelligence, Journal of the Royal Statistical Society, Series B), a Wiley book on data science, and is an invited speaker at international conferences. He also holds a few patents on scoring technology, and raised $6 MM in VC funding for his first startup. Vincent also created the first IoT platform to automate growth and content generation for digital publishers, using a system of API's for machine-to-machine communications, involving Hootsuite, Twitter, and Google Analytics.

Vincent's profile is accessible here and includes top publications, presentations, and work experience with Visa, Microsoft, eBay, NBC, Wells Fargo, and other organisations.


To find out whether this book might be useful to you, read my introduction.

Table of Content

  • Introduction xxi
  • Chapter 1 What Is Data Science? 1
  • Chapter 2 Big Data Is Different 41
  • Chapter 3 Becoming a Data Scientist 73
  • Chapter 4 Data Science Craftsmanship, Part I 109
  • Chapter 5 Data Science Craftsmanship, Part II 151
  • Chapter 6 Data Science Application Case Studies 195
  • Chapter 7 Launching Your New Data Science Career 255
  • Chapter 8 Data Science Resources 287
  • Index 299

Chapter 1 - What Is Data Science? 1

Real Versus Fake Data Science 2

  • Two Examples of Fake Data Science 5
  • The Face of the New University 6

The Data Scientist 9

  • Data Scientist Versus Data Engineer 9
  • Data Scientist Versus Statistician 11
  • Data Scientist Versus Business Analyst 12

Data Science Applications in 13 Real-World Scenarios 13

  • Scenario 1: DUI Arrests Decrease After End of State Monopoly on Liquor Sales 14
  • Scenario 2: Data Science and Intuition 15
  • Scenario 3: Data Glitch Turns Data Into Gibberish 18
  • Scenario 4: Regression in Unusual Spaces 19
  • Scenario 5: Analytics Versus Seduction to Boost Sales 20
  • Scenario 6: About Hidden Data 22
  • Scenario 7: High Crime Rates Caused by Gasoline Lead. Really? 23
  • Scenario 8: Boeing Dreamliner Problems 23
  • Scenario 9: Seven Tricky Sentences for NLP 24
  • Scenario 10: Data Scientists Dictate What We Eat? 25
  • Scenario 11: Increasing Sales with Better Relevancy 27
  • Scenario 12: Detecting Fake Profiles or Likes on Facebook 29
  • Scenario 13: Analytics for Restaurants 30

Data Science History, Pioneers, and Modern Trends 30

  • Statistics Will Experience a Renaissance 31
  • History and Pioneers 32
  • Modern Trends 34
  • Recent Q&A Discussions 35

Summary 39

Chapter 2 - Big Data Is Different 41

Two Big Data Issues 41

  • The Curse of Big Data 41
  • When Data Flows Too Fast 45

Examples of Big Data Techniques 51

  • Big Data Problem Epitomizing the Challenges of Data Science 51
  • Clustering and Taxonomy Creation for Massive Data Sets 53
  • Excel with 100 Million Rows 57

What MapReduce Can’t Do 60

  • The Problem 61
  • Three Solutions 61
  • Conclusion: When to Use MapReduce 63

Communication Issues 63

Data Science: The End of Statistics? 65

  • The Eight Worst Predictive Modeling Techniques 65
  • Marrying Computer Science, Statistics,and Domain Expertise 67

The Big Data Ecosystem 70

Summary 71

Chapter 3 - Becoming a Data Scientist 73

Key Features of Data Scientists 73

  • Data Scientist Roles 73
  • Horizontal Versus Vertical Data Scientist 75

Types of Data Scientists 78

  • Fake Data Scientist 78
  • Self-Made Data Scientist 78
  • Amateur Data Scientist 79
  • Extreme Data Scientist 80

Data Scientist Demographics 82

Training for Data Science 82

  • University Programs 82
  • Corporate and Association Training Programs 86
  • Free Training Programs 87

Data Scientist Career Paths 89

  • The Independent Consultant 89
  • The Entrepreneur 95

Summary 107

Chapter 4 - Data Science Craftsmanship, Part I 109

New Types of Metrics 110

  • Metrics to Optimize Digital Marketing Campaigns 111
  • Metrics for Fraud Detection 112

Choosing Proper Analytics Tools 113

  • Analytics Software 114
  • Visualization Tools 115
  • Real-Time Products 116
  • Programming Languages 117

Visualization 118

  • Producing Data Videos with R 118
  • More Sophisticated Videos 122

Statistical Modeling Without Models 122

  • What Is a Statistical Model Without Modeling? 123
  • How Does the Algorithm Work? 124
  • Source Code to Produce the Data Sets 125

Three Classes of Metrics: Centrality, Volatility, Bumpiness 125

  • Relationships Among Centrality, Volatility, and Bumpiness 125
  • Defining Bumpiness 126
  • Bumpiness Computation in Excel 127
  • Uses of Bumpiness Coefficients 128

Statistical Clustering for Big Data 129

Correlation and R-Squared for Big Data 130

  • A New Family of Rank Correlations 132
  • Asymptotic Distribution and Normalization 134

Computational Complexity 137

  • Computing q(n) 137
  • A Theoretical Solution 140

Structured Coefficient 140

Identifying the Number of Clusters 141

  • Methodology 142
  • Example 143

Internet Topology Mapping 143

Securing Communications: Data Encoding 147

Summary 149

Chapter 5 - Data Science Craftsmanship, Part II 151

Data Dictionary 152

  • What Is a Data Dictionary? 152
  • Building a Data Dictionary 152

Hidden Decision Trees 153

  • Implementation 155
  • Example: Scoring Internet Traffic 156
  • Conclusion 158

Model-Free Confidence Intervals 158

  • Methodology 158
  • The Analyticbridge First Theorem 159
  • Application 160
  • Source Code 160

Random Numbers 161

Four Ways to Solve a Problem 163

  • Intuitive Approach for Business Analysts with Great Intuitive Abilities 164
  • Monte Carlo Simulations Approach for Software Engineers 165
  • Statistical Modeling Approach for Statisticians 165
  • Big Data Approach for Computer Scientists 165

Causation Versus Correlation 165

How Do You Detect Causes? 166

Life Cycle of Data Science Projects 168

Predictive Modeling Mistakes 171

Logistic-Related Regressions 172

  • Interactions Between Variables 172
  • First Order Approximation 172
  • Second Order Approximation 174
  • Regression with Excel 175

Experimental Design 176

  • Interesting Metrics 176
  • Segmenting the Patient Population 176
  • Customized Treatments 177

Analytics as a Service and APIs 178

  • How It Works 179
  • Example of Implementation 179
  • Source Code for Keyword Correlation API 180

Miscellaneous Topics 183

  • Preserving Scores When Data Sets Change 183
  • Optimizing Web Crawlers 184
  • Hash Joins 186
  • Simple Source Code to Simulate Clusters 186

New Synthetic Variance for Hadoop and Big Data 187

  • Introduction to Hadoop/MapReduce 187
  • Synthetic Metrics 188
  • Hadoop, Numerical, and Statistical Stability 189
  • The Abstract Concept of Variance 189
  • A New Big Data Theorem 191
  • Transformation-Invariant Metrics 192
  • Implementation: Communications Versus Computational Costs 193
  • Final Comments 193

Summary 193

Chapter 6 - Data Science Application Case Studies 195

Stock Market 195

  • Pattern to Boost Return by 500 Percent 195
  • Optimizing Statistical Trading Strategies 197
  • Stock Trading API: Statistical Model 200
  • Stock Trading API: Implementation 202
  • Stock Market Simulations 203
  • Some Mathematics 205
  • New Trends 208

Encryption 209

  • Data Science Application: Steganography 209
  • Solid E‑Mail Encryption 212
  • Captcha Hack 214

Fraud Detection 216

  • Click Fraud 216
  • Continuous Click Scores Versus Binary Fraud/Non-Fraud 218
  • Mathematical Model and Benchmarking 219
  • Bias Due to Bogus Conversions 220
  • A Few Misconceptions 221
  • Statistical Challenges 221
  • Click Scoring to Optimize Keyword Bids 222
  • Automated, Fast Feature Selection with Combinatorial Optimization 224
  • Predictive Power of a Feature and Cross-Validation 225
  • Association Rules to Detect Collusion and Botnets 228
  • Extreme Value Theory for Pattern Detection 229

Digital Analytics 230

  • Online Advertising: Formula for Reach and Frequency 231
  • E‑Mail Marketing: Boosting Performance by 300 Percent 231
  • Optimize Keyword Advertising Campaigns in 7 Days 232
  • Automated News Feed Optimization 234
  • Competitive Intelligence with 234
  • Measuring Return on Twitter Hashtags 237
  • Improving Google Search with Three Fixes 240
  • Improving Relevancy Algorithms 242
  • Ad Rotation Problem 244

Miscellaneous 245

  • Better Sales Forecasts with Simpler Models 245
  • Better Detection of Healthcare Fraud 247
  • Attribution Modeling 248
  • Forecasting Meteorite Hits 248
  • Data Collection at Trailhead Parking Lots 252
  • Other Applications of Data Science 253

Summary 253

Chapter 7 - Launching Your New Data Science Career 255

Job Interview Questions 255

  • Questions About Your Experience 255
  • Technical Questions 257
  • General Questions 258
  • Questions About Data Science Projects 260

Testing Your Own Visual and Analytic Thinking 263

  • Detecting Patterns with the Naked Eye 263
  • Identifying Aberrations 266
  • Misleading Time Series and Random Walks 266

From Statistician to Data Scientist 268

  • Data Scientists Are Also Statistical Practitioners 268
  • Who Should Teach Statistics to Data Scientists? 269
  • Hiring Issues 269
  • Data Scientists Work Closely with Data Architects 270
  • Who Should Be Involved in Strategic Thinking? 270
  • Two Types of Statisticians 271
  • Using Big Data Versus Sampling 272

Taxonomy of a Data Scientist 273

  • Data Science’s Most Popular Skill Mixes 273
  • Top Data Scientists on LinkedIn 276

400 Data Scientist Job Titles 279

Salary Surveys 281

  • Salary Breakdown by Skill and Location 281
  • Create Your Own Salary Survey 285

Summary 285

Chapter 8 - Data Science Resources 287

Professional Resources 287

  • Data Sets 288
  • Books 288
  • Conferences and Organizations 290
  • Websites 291
  • Definitions 292

Career-Building Resources 295

  • Companies Employing Data Scientists 296
  • Sample Data Science Job Ads 297
  • Sample Resumes 297

Summary 298

Index 299

Follow me on Twitter at @ROIdoctor

Read more…

A Data Scientist's Journey

I describe here the projects that I worked on, as well as career progress, starting 25 years ago as a PhD student in statistics, until today, and the transformation from statistician to data scientist that occurred slowly and started more than 20 years ago. This also illustrates many applications of data science, most are still active.

Early years

My interest in mathematics started when I was 7 or 8, I remember being fascinated by the powers of 2 in primary school, and later purchasing cheap russian math books (Mir publisher) translated in French, for my entertainement. In high school, I participated in the mathematical olympiads, and did my own math research during math classes, rather than listening to the very boring lessons. When I attended college, I stopped showing up in the classroom altogether - afterall, you could just read the syllabus, memorize the material before the exam and regurgitate it at the exam. Moving fast forward, I ended up with a PhD summa cum laude in (computational) statistics, followed by a joint postdoc in Cambridge (UK) and the National Institute of Statistical Science (North Carolina). Just after completing my PhD, I had to do my military service, where I learned old data base programming (SQL on DB2) - this helped me get my first job in the corporate world in 1997 (in US), where SQL was a requirement - and still is today for most data science positions.

My academia years (1988 - 1996)

My major was in Math/Stats at Namur University, and I was exposed between 1988 and 1997 to a number of interesting projects, most being precursors to data science:

  • Object oriented programming in Pascal and C++
  • Writing your own database software in Pascal (student project)
  • Simulation of pointers, calculator and recursion, in Fortran or Pascal
  • Creating my own image processing software (in C), reverse-engineering Windows bitmaps formats and directly accessing memory with low-level code to load images 20 times faster than Windows
  • Working on image segmentation and image filtering (signal processing, noise removal, de-blurring), using mixture models / adaptive density estimators in high dimensions
  • Working with engineers on geographic information systems, and fractal image compression - a subject that I will discuss in my next book on automated data science. At the same time, working for a small R&D company, I designed models to predict extreme floods, using 3-D ground numerical models. The data was stored in an hierarchical database (digital images based on aerial pictures, the third dimension being elevation, and ground being segmented in different categories - water, crop, urban, forest etc.) Each pixel represented a few square feet.
  • Extensive research on simulation (to generate high quality random numbers, or to simulate various stochastic processes that model complex cluster structures)
  • Oil industry: detecting oild field boundary by minimizing the number of dry wells - known as the inside / outside problem, and based on convex domains estimation
  • Insurance: segmentation and scoring clients (discriminate analysis) 

At Cambridge University in 1995 (click here to see the names of all these statisticians)

When I moved to Cambridge university stats lab and then NISS to complete my post-doc (under the supervision of Professor Richard Smith), I worked on:

  • Markov Chains Monte Carlo modeling (Bayesian hierarchical models appied to complex cluster structures)
  • Spatio-tempral models
  • Environmental statistics: storm modeling, extreme value theory, and assessing leaks at the Hanford nuclear reservation (Washington State), using spatio-temporal models applied to chromium levels measured in 12 wells. The question was: is there a trend - increased leaking - and is it leaking into the Columbia river located a few hundred feet away?
  • Disaggregation of rainfall time series (purpose: improve irrigation, the way water is distributed during the day - agriculture project) 
  • I also wrote an interesting article on simulating multivariate distributions with known marginals

Note: AnalyticBridge's logo represents the mathematical bridge in Cambridge.

My first years in the corporate world (1996 - 2002)

I was first offered a job at MapQuest, to refine a system that helps car drivers with automated navigation. At that time, location of the vehicule was not determined by GPS, but by checking the speed and changes in direction (measured in degrees, as the driver makes a turn). This technique was prone to errors and that's why they wanted to hire a statistician. But eventually, I decided to work for CNET instead, as they offered a full time position rather than a consulting role.

I started in 1997 working for CNET, at that time a relatively small digital publisher (they eventually acquired ZDNet). My first project involved designing an alarm system, to send automated email to channel managers whenever traffic numbers were too low or too high: a red flag indicated significant under-performance, a bold red-flag indicated extreme under-performance. Managers could then trace the dips and spikes to events taking place on the platform, such as double load of traffic numbers (making the numbers 2x as big as they should be), web site down for a couple of hours, promotion etc. The alarm system used SAS to predict traffic (time series modeling, with seasonality, and confidence intervals for daily estimates), Perl/CGI to develop it as an API, access databases, and to send automated email, Sybase (star schema) to access traffic database and create a small database of predicted/estimated traffic (to match with real, observed traffic), and of course, cron jobs to run everything automatically, in batch mode, according to a pre-specified schedule - and resume automatically in case of crash or other failure (e.g. when production of traffic statistics were delayed or needed to be fixed fitst, due to some glitch). This might be the first time that I created automated data science.

Later in 2000, I was involved with market research, business and competitive intelligence. My title was Sr. Statistician. Besides identifying, defining, and designing tracking (measurement) methods for KPI's, here are some of the interesting projects I worked on:

  • Ad inventory management, in collaboration with GE (they owned NBCi, the company I was working for, and NBCi was a spin-off of CNET, first called in 1999). We worked on better predicting number of impressions available for advertisers, to optimize sales, reduce both over-booking and unsold inventory. I also came up with the reach and frequency formula  (a much cleaner description can be found in my book page 231). Note that most people consider this to be a supply chainproblem, which is a sub-domain of operations research. It nevertheless is very statistics-intensive and heavily based on data, especially when the inventory is digital and very well tracked. 
  • Price elasticity study for, to determine optimum prices based on prices offered by competitors, number of competing products and other metrics. The statistical model was not linear, it involved variables such as minimum price offered by competitors, for each product and each day. I used a web crawler to extract the pricing information (displayed on the website) because the price database was terribly bad, with tons of missing prices and erroneous data.
  • Advertising mix optimization, using a long-term, macro-economic time series model, with monthly data from various sources (ad spend for various channels). I introduced a decay in the model, as TV ads seen six months ago still had an impact today, although smaller than recent ads. The model included up to the last 6 months worth of data.
  • Attribution modeling. How to detect, among a mix of 20 TV shows used each week to promote a website, which TV shows are most effective in driving sticky traffic to the website in question (NBCi). In short, the problem consists in attributing to each TV show, its share in traffic growth, to optimize the mix, and thus, the growth. It also includes looking at the lifetime value of a user (survival models) based on acquisition channel. You can find more details on this in my book page 248.

Consulting years (2002 - today)

I worked for various companies - Visa, Wells Fargo, InfoSpace, Looksmart, Microsoft, eBay, sometimes even as a regular employee, but mostly in a consulting capacity. It started with Visa in 2002, after a small stint with a statistical litigation company (William Wecker Associates), where I improved time-to-crime models that were biased because of right-censorship in the data (future crimes attached to a gun are not seen yet - this was an analysis in connection with the gun manufacturers lawsuit).

At Visa, I developed multivariate features for credit card fraud detection in real time, especially single-ping fraud, working on data sets with 50 million transactions - too big for SAS to handle at that time (a SAS sort would crash), and that's when I first developed Hadoop-like systems (nowadays, SAS sort can very easily handle 50 million rows without visible Map-Reduce technology). Most importantly, I used Perl, associative arrays and hash tables to process hundreds of feature combinations (to detect the best one based on some lift metric) while SAS would - at that time - process one feature combination over the whole weekend. Hash tables were used to store millions of bins, so an important part of the project was data binning - doing it right (too many bins results in a need for intensive Hadoop-like programming, too few results in lack of accuracy or predictability). That's when I came up with the concepts of hidden decision treespredictive power of a feature, and testing a large number of feature combinations simultaneously. This is much better explained in my book pages 225-228 and pages 153-158.  

After Visa, I worked at Wells Fargo, and my main contribution was to find that all our analyses were based on wrong data. It had been wrong for a long time without anyone noticing, well before I joined this project: Tealeaf sessions spanning accross multiple servers were broken in small sessions (we discovered it by simulating our own sessions and look at what shows up in the log files, the next day), making it impossible to really track user activity. Of course we fixed the problem. The purpose here was to make user navigation easier, and identify when a user is ready for cross-selling, and which products should be presented to him/her based on history.

So I moved away from the Internet, to Finance and fraud detection. But I came back to the Internet around 2005, this time to focus on traffic quality, click fraud, taxonomy creation, and optimizing bids on Google keywords - projects that require text mining and NLP (natural language processing) expertize. My most recent consulting years involved the following projects:

  • Microsoft: time series analysis (with correlograms) to detect and assess intensity of change points or new trends in KPI's, and match them with events (such as redifining user, which impacts user count). Blind analysis, in the sense that I was told about the events AFTER I detected the change points.
  • Adaptive A/B testing, where sample sizes are updated every day, to increase sample of best performer (and decrease sample of worst performer) as soon as a slight but statistically significant trend is discovered, usually half-way during the test
  • eBay: automated bidding for millions of keywords (old and new - most have no historical data) uploaded daily on Google AdWords. I also designed keyword scoring systems that predict performance in the absence of historical data by looking at metrics such as keyword length, number of tokens, presence of  digits or special characters or special tokens such as 'new', keyword rarity, keyword category and related keywords.
  • Click fraud and Botnet detectioncreation of new scoring system that uses IP flagging (by vendors such as Spamhaus and Barracuda) rather than unreliable conversion metrics, to predict quality and re-route Internet traffic. Ad matching (relevancy) algorithms (see my book pages 242-244). Creation of the Internet topology mapping (or see my book pages 143-147) to cluster IP addresses and better catch fraud, using advanced metrics based on these clusters.
  • Taxonomy creation: for instance I identified that the restaurant category, in the Yellow Pages, breaks down not only by type of cuisine, but also by type of restaurant (city, view, riverfront, mountain, pub, wine bar), atmosphere (elegant, romantic, family, fast food) and stuff that you don't eat (menus, recipes, restaurant jobs and furnitures). This analysis was based on millions of search queries from the Yellow Pages and crawling and parsing massive amounts of text (in particular the whole DMOZ directory including crawling 1 million websites), to identify keyword correlations. Click here for details, or read my book, pages 53-57.
  • Creation of the list of top 100,000 commercial keywords and categorization by user intent. Extensive use of Bing API and Google's paid keyword API to extract statistics for millions of keywords (value and volume depending on match type), as well as to detect new keywords. Click here for relatedsearch intelligence problem. 

During these years, I also created my first start-up to score Internet traffic (raising $6 million in funding) and produced a few patents.


As the co-founder of DataScienceCentral, I am also the data scientist on board, optimizing our email blasts and traffic growth with a mix of paid and organic traffic as well as various data science hacks. I alsooptimize client campaigns and manage a system of automated feeds for automated content production (see my book page 234). But the most visible part of my activity consists of 

  • The data science apprenticeship and certification
  • The data science research lab, publishing original articles (many are included in my book) such as jackknife regression, taxonomy creation, model-free confidence intervals, random number simulator, steganography, data compression, the curse of big data, Map-Reduce articles: click here for details.
  • The preprint service for academics (coming soon)
  • The data science cheat sheet (in progress)
  • Book on automated data science (coming soon)
  • Organizing data science competitions

I am also involved in designing API's and AaaS (Analytics as a Service). I actually wrote my first API in 2002 to sell stock trading signals: read my book pages 195-208 for details. I was even offered a job at Edison (utility company in Los Angeles) to trade electricity on their behalf. And I also worked on otherarbitraging systems, in particular click arbitraging.


Grew revenue and profits from 5 to 7 digits in less than two years, while maintaining profit margins above 65%. Grew traffic and membership by 300% in two years. Introduced highly innovative, efficient, and scalable advertising products for our clients. DataScienceCentral is an entirely self-funded, lean startup with no debt, and no payroll (our biggest expense on gross revenue is taxes). I used state-of-the-art growth science techniques to outperform competition.

Publications, Conferences


Refereed Publications

  • Granville V., Rasson J.P. A strange recursive relationJournal of Number Theory 30 (1988) 238-241.
  • Granville V., Engelen M. Un modele de prevision des debits pour la SemoisLa Tribune de l’Eau 42 (1989), 41-48.
  • Granville V., Rasson J.P. A new type of random permutation generator to simulate random images.Computational Statistics Quarterly 6 (1990), 55-64.
  • Granville V., Krivanek M., Rasson J.P. Remarks on computational complexity of hyper-volume classification methodComputational Statistics Quarterly 6 (1991), 315-319.
  • Granville V. An introduction to random number theoryBulletin of the Belgian Mathematical Society, Series A, 43 (1991), 431-440.
  • Granville V. Rainbow, a software for raster image processingComputational Statistics 7 (1992), software section, 209-212.
  • Granville V., Rasson J.P. Density estimation on a finite regular latticeComputational Statistics 7 (1992), 129-136.
  • Granville V., Rasson J.P. A new modeling of noise  in image remote sensingStatistics and Probability Letters 14 (1992), 61-65.
  • Granville V., Krivanek M., Rasson J.P. Clustering, classification and image segmentation on the grid.Computational Statistics and Data Analysis 15 (1993), 199-209.
  • Granville V., Rasson J.P. A Bayesian filter for a mixture model of noise in image remote sensing.Computational Statistics and Data Analysis 15 (1993), 297-308.
  • Granville V., Rasson J.P. Markov random field models in image remote sensingInComputer Intensive Methods in Statistics. Hardle W. and Simar L. Ed., Physica-Verlag, Heidelberg, 1993, 98-112.
  • Granville V., Schifflers E. Efficient algorithms for exact inference in 2 x 2 contingency tables.Statistics and Computing 3 (1993), 83-87.
  • Granville V. Rasson J.P., Orban-Ferauge F. Un modele Bayesien de segmentation d’imagesIn:Teledetection appliquee a la cartographie thematique et topographiquePresses de l’Universite du Quebec, Sainte-Foy (Canada), 1993, 305-310.
  • Granville V., Krivanek M., Rasson J.P. Simulated annealing: a proof of convergenceIEEE Transactions on Pattern Analysis and Machine Intelligence 16 (1994), 652-656.
  • Granville V., Rasson J.P. Bayesian filtering and supervised classification in image remote sensing.Computational Statistics and Data Analysis 20 (1995), 203-225.
  • Granville V., Rasson J.P. Multivariate discriminate analysis and maximum penalized likelihood density estimationJournal of the Royal Statistical Society, Series B, 57 (1995), 501-517.
  • Granville V. Discriminate analysis and density estimation on the finite d-dimensional grid.Computational Statistics and Data Analysis 22 (1996), 27-51.
  • Rasson J.P., Granville V. Geometrical tools in classificationComputational Statistics and Data Analysis 23 (1996), 105-123.
  • Granville V. Estimation of the intensity of a Poisson point process by means of nearest neighbor distancesStatistica Neerlandica 52 (1998) 112-124. Also available as Tech. Rep. 96-15, Statistical Laboratory, University of Cambridge.

Other Selected Publications

  • Granville V., Smith R.L. Clustering and Neyman-Scott process parameter simulation via Gibbs samplingTech. Rep. 95-19, Statistical Laboratory, University of Cambridge (1995, 20 pages).
  • Granville V. Sampling from a bivariate distribution with known marginalsTech. Rep. 47, National Institute of Statistical Science, Research Triangle Park, North Carolina (1996).
  • Granville V., Smith R.L. Disaggregation of rainfall time series via Gibbs sampling. Preprint, Statistics Department, University of North Carolina at Chapell Hill (1996, 20 pages).
  • Granville V., Rasson J.P., Orban-Ferauge F. From a natural to a behavioral classification ruleIn:Earth observation, Belgian Scientific Space Research (Ed.), 1994, pp. 127-145.
  • Granville V. Bayesian filtering and supervised classification in image remote sensingPh.D. Thesis, 1993.
  • Granville V. K-processus et application au debruitage d’imagesMaster Thesis, 1988.
  • Granville V. Data Science Cheat Sheet (in progress)

Conference and Seminars

  • COMPSTAT 90, Dubrovnik, Yugoslavia, 1990. Communication: A new modeling of noise in image remote sensing. Funded by the Communaute Francaise de Belgique and FNRS.
  • 3rd Journees Scientifiques du Reseau de Teledetection de l’UREF, Toulouse, France, 1990. Poster: Une approche non gaussienne du bruit en traitement d’images. Funded.
  • 4th Journees Scientifiques du Reseau de Teledetection de l’UREF, Montreal, Canada, 1991. Poster: Un modele Bayesien de segmentation d’images. Funded by AUPELF-UREF.
  • 12th Franco-Belgian Meeting of Statisticians, Louvain, Belgium, 1991. Invited communication:Markov random field models in image remote sensing.
  • 24th Journees de Statistiques, Bruxelles, Belgium, 1992. Software presentation: Rainbow, un logiciel graphique dedicace.
  • 8th International Workshop on Statistical Modeling, Leuven, Belgium, 1993. Poster: Discriminate analysis and density estimation on the finite d-dimensional grid.
  • 14th Franco-Belgian Meeting of Statisticians, Namur, Belgium, 1993. Invited communication:Mesures d’intensite et outils geometriques en analyse discriminante.
  • Center for Wiskunde en Informatica, Amsterdam, Netherlands, 1993 (one week stay). Invited seminar: Bayesian filtering and supervised classification in image remote sensing. Invited by Professor Adrian Baddeley and funded by CWI.
  • Annual Meeting of the Belgian Statistical Society, Spa, Belgium, 1994. Invited communication:Maximum penalized likelihood density estimation.
  • Statistical Laboratory, University of Cambridge, 1994. Invited seminar: Discriminate analysis and filtering: applications in satellite imagery. Invited by Professor R.L. Smith. Funded by the University of Cambridge.
  • Invited seminars on Clustering, Bayesian statistics, spatial processes and MCMC simulation:
    • Biomathematics and Statistics Scotland (BioSS), Aberdeen, 1995. Invited by Rob Kemptonand funded by BioSS.
    • Department of Mathematics, Imperial College, London, 1995. Invited by Professor A. Walden.
    • Statistical Laboratory, Iowa State University, Ames, Iowa, 1996. Invited by Professor Noel Cressie and funded by Iowa Ste University.
    • Department of Statistics and Actuarial Sciences, University of Iowa, Iowa City, Iowa, 1996. Invited by Professor Dale Zimmerman.
    • National Institute of Statistical Sciences (NISS), Research Triangle Park, North Carolina, 1996. Invited by Professor J. Sacks and Professor A.F. Karr. Funded by NISS.
  • Scientific Meeting FNRS, Brussels, 1995. Invited communication: Markov Chain Monte Carlo methods in clustering. Funded.
  • 3rd European Seminar of Statistics, Toulouse, France, 1996. Invited. Funded by the EC.
  • 1st European Conference on Highly Structured Stochastic Systems, Rebild, Denmark, 1996. Contributed paper: Clustering and Neyman-Scott process parameter estimation via Gibbs sampling. Funded by the EC.
  • National Institute of Statistical Sciences (NISS), Research Triangle Park, North Carolina, 1996. Seminar: Statistical analysis of chemical deposits at the Hanford site.
  • Institute of Statistics and Decision Sciences, Duke University, Durham, North Carolina, 1997. Invited seminar: Stochastic models for hourly rainfall time series: fitting and statistical analysis based on Markov Chain Monte Carlo methods. Invited by Professor Mike West.
  • Southern California Edison, Los Angeles, 2003. Seminar: Efficient risk management for hedge fund managers. Invited and funded.
  • InfoSpace, Seattle, 2004. Seminar: Click validation, click fraud detection and click scoring for pay-per-click search engines. Invited and Funded.
  • AdTech, San Francisco, 2006. Click fraud panel.
  • AMSTAT JSM, Seattle, 2006. Talk: How to address click fraud in Pay-per-Click programs.
  • Predictive Analytics World, San Francisco, 2008. Talk: Predictive keyword scores to optimize pay-per-click campaigns. Invited and Funded.
  • M2009 – SAS Data Mining Conference, Las Vegas, 2009. Talk: Hidden decision trees to design predictive scores – application to .... Invited and Funded. 
  • Text Analytics Summit, San Jose, 2011. Detection of Spam, Unwelcomed Postings, and Commercial Abuses in So.... Invited. 

Follow me on Twitter at @ROIdoctor

Read more…