This was one of the questions asked by a colleague of mine who is new to the data analytics and Machine learning space. I had to tell him that this isn’t possible because data itself is not suitable to do any sort of valuable predictions. It usually takes some time to get this intuition. Those who have statistics background can quickly catch-up, others have to put up extra effort to solidify these concepts.

So, back to the question, can we build a model to predict lottery numbers ?
The short answer is NO, and the long answer is NO :). Now you have to ask why !

Why this post ?

This question may be a very trivial one for seasoned ML practitioners, but
sometimes experienced one also misses the key basic concepts when it comes to
how to deal with data and what data is good for a particular problem.
This is my attempt to answer the questions that I had !

Looking into the winning chances

Let’s take one lottery series for this experiment, this lottery number has 2
alphabet and 6 numbers, eg; AG389435. In this case how many unique lottery
tickets can be printed ?

$$ \begin{array}{l} Total\ number\ of\ lotteries\ for\ number\ sequence\ of\ form\ \mathbf{AB123456}\\ \\ =\ 26^{2} \ *\ 10\ ^{6} \ \\ =\ 676\ Million\\ \\ =\ Chance\ of\ winning\ for\ a\ person\ =\ \frac{1}{676\ Million} \ =\ \mathbf{1.47*10^{-9}}\\ \\ Here\ we\ are\ assuming\ equal\ chance\ for\ all\ to\ win,\ and\ the\ lottery\ system\ will\\ ensure\ this\ constrain.\\ \\ So\ winning\ probability/chance\ is\ \mathbf{0.000000147} \ \ or\ 1\ in\ 676\ Million \end{array}$$

To put this number into context, let's take another study done to identify the
cancer risk due to smoking [[link]] 2. According to their study they estimate 1 in 10
men and 1 in 8 women in India can expect to develop cancer of any form, in their
life span after the age of 35 year. This means chance of getting cancer is way higher
than winning a lottery.

Why am I saying lottery data isn't usable for any prediction

Because the lottery prediction is done using some random number generator (RNG),
this means RNG ensures that every lottery purchased gets equal probability of
winning, usually the odds of winning is above 1 in Million. And there won't be
any relation between historical predicted number and the future numbers, because
they are close to independent. So you can't find any relation from the
historical winning numbers.

If we are using some faulty RNG machine or RNG system with lesser entropy then
the results may be skewed, so we can see patterns in number prediction.
So how few individuals broke the lottery system legally, below listed are the
two instances people worked hard to increase their chances of winning, here also
they aren't predicting the actual winning numbers, instead increasing the chance
of winning probability of each lottery.

  1. Michle Memorized live game patterns

2. Or you have to run a lottery syndicate to increase the chance of winning by
  pooling the tickets from multiple contributors. Here is an interesting guy
  who did this thing across the world. It's purely playing against the odds by
  purchasing more tickets.

Does an ML model won’t predict the winning numbers at all ?

If you run enough times, your model may predict right answers some times, so why
are you saying ML models can’t predict correctly at all ? We call a model doing
good, when it produces predictions better than simple guessing. If your model
prediction accuracy is close to any other simple guess work, then why do you
need this ML model after all.

The predictions are really based on known patterns. The patterns are coming out
of data ( data may be image, sound, simple numbers any thing).

When can we do ML

We can do any sort of data analysis or prediction using statistical methods only
when the data has some repeating patterns or correlation or some inherent
orders.

eg; When you see a different type of cat, you can identify it as cat, how’s that
working, when you look at it objectively there are some underlying order in
their pixels and behaviours, and our brain maps those patterns of light signals
to cat. And you can apply this same analogy ( similar technique ) to our ML
models here, models learn to identify cat / dog by capturing these common
patterns present in the data or image or video. If the data doesn’t bring these
patterns, how can any system or our brain identify a particular object ?

Similarities with Language Models

Number of sequences in the lottery number can be treated as a finite sequence.
The random numbers from RNG usually pick random numbers from this finite
sequence. Let's connect this to the Language models in NLP. NLP language models
predict the next word from whatever it has seen till now. If we take all the
words in english dictionary, it comes around 171,476 words. If we arrange these
words in particular order it forms a valid sentence, ie; we can't put all
combinations of words to form a valid sentence. This means there is a fixed
order of word sequence, this is what the NLP models learn internally via word
co-occurrence and other methods.

Now come back and see how the sequence predicted by RNG for the lottery has some
pattern ?, it shouldn't. So we can't find any sort of significant statistical
correlation between two lottery numbers.

How click streams are helping for recommendation

This is another way of pulling out data that has some meaning or patterns
present in it. When you browse over any e-commerce site, for a given query your
mouse movement leaves out some meaning about those items. They are related items
!. So if you pull out the clickstream done by a user for a session we can find
the related items and using that data we can do recommendations for a given
query.

You can read more about this from this paper from Airbnb

NOTE: The original version was posted in my personal blog.