This course is for you if you want to fully equip yourself with the art of applied machine learning using MATLAB. We will apply the most commonly used data preprocessing techniques without having to learn all the complicated maths. Additionally, if you have had previous hours and hours of machine learning implementation but could never figure out how to further improve the peformance of the machine learning algorithms. By the end of this course, you will have at your fingertips a vast variety of the most commonly used data preprocessing techniques that you can use instantly to maximize your insight into your data set.
This course is for you if you want to have a real feel of the Machine Learning techniques without having to learn all of the complicated math. Additionally, this course is also for you if you have had previous hours of machine learning theory but could never figure out how to implement and solve data science problems with it.
The approach in this course is very practical and we will begin with the basics. We will immediately start coding after a couple of introductory tutorials and we try to keep the theory to bare minimal. All of the coding will be done in MATLAB which is one of the fundamental programming languages for engineer and science students, and is frequently used by top data science research groups world wide.
Below is a complete list of topics covered in the video. You will find the timestamps on YouTube.
In this class we’re going to talk about the multivariate Gaussian. So, in all our past lectures we’ve looked at the one-dimensional case, so one-dimensional distributions both discrete and continuous. When we talk about the multivariate Gaussian distribution we’re talking about two or more dimensions, and of course MATLAB is perfect for this because it works with all matrices and vectors. So now what happens when you extend the Gaussian like this? So the first thing is that the mean becomes a vector, and again this doesn’t really change the definition of the mean. You still add up all the values of X and divide by n to get your sample mean. So as an example let’s say we have a random matrix with 100 values and with 10 dimensions. Alright, so, if we do mean(X) we get a 1 by 10 matrix, and each component is the 100 values in that column summed up together and divided by 100. The thing that is kind of more different than the one dimensional case is the variance. So we see this Sigma symbol here and what this stands for is the covariance matrix. So, with the variance it told us how far X was spread out from the mean, the covariance does that too but only on the diagonals. So, on the off diagonals it tells us how correlated one dimension is with the other, and so you can see here the definition of the covariance matrix is the covariance of the if dimension with the jf dimension, and that would be the ij of element of the covariance matrix. We don’t really need to worry about how to calculate it because we are in the end going to end up using the matlab function cov. So let’s check that covariance is 10 by 10, so our dimensionality is 10. What’s interesting about the PDF of the Gaussian in the multi-dimensional case is that we never use the covariance directly. We use the inverse right here in the exponent and the determinant, and again since we already wrote a function to calculate Gaussians before I’m not going to ask you to do it again. There’s already a function called mvnpdf in matlab that will do this for you. So I’ve got an example here that I want to show you. We set our mean and our covariance… Don’t worry about this code too much since it is not the focus of this lecture, we’re simply setting the x-axis, so the X1 and X2 values. Note that the dimensionality of X is 2, so mu is two-dimensional and Sigma is two by two. So here we calculate PDF value using mvnpdf, and we’re going to do a surface plot of F versus X1 and X2. So now some points worth talking about is this distribution is more spread out on the X2 axis than the X1 axis, and that’s because you see here this distribution has a variance of 1 on the X2 axis, but a variance of point .25 on the X 1 axis, and since the off diagonals are not zero what we get is a Gaussian that is not perpendicular to the X1 and X2 axes.
In this lecture we’re going to continue our talk about hypothesis testing of Gaussian distributed data. So to elaborate on how hypothesis tests work we’re going to talk about the hypotheses a little bit more. So there are two hypotheses when we’re doing a statistical test. There’s a null hypothesis which we call h-naught, and it usually represents the control data or random data that results purely from chance. The second hypothesis is the alternative hypothesis. So this is the thing we’re trying to prove, generally. For example, if we’re doing an experiment where we’re testing a drug, the drug actually working would be the alternative hypothesis. And so the alternative hypothesis is the hypothesis that sample observations are influenced by some non-random cause. So now suppose we have some random data, so let’s say R1 equals randn. Let’s say it as a hundred points, a mean of zero, and a variance of one. Now let’s say we have a another data set which is maybe 20 points, but this one’s going to have a different mean. So let’s say it’s mean is one, and let’s increase its variance a little bit. Okay, so, we have two distributions that are Gaussian distributed. The first one has 0 mean and one variance, and the second one has a mean of one but a standard deviation of two. So, how do we compare these two distributions? There is a test called the ttest, or the two-sided t-test, or the two-sample t-test which does what we want, and it again returns a hypothesis and a p-value. So we can try this see ttest2(R1,R2). Ok, so, we reject the null hypothesis in this case with a very, very small p-value. Remember it only has to be less than five percent for us to reject the null hypothesis. So let’s try some things. Let’s have less data points for R1. Okay, and let’s do our ttest again. So notice how we still reject the null hypothesis but our p value has increased, so it’s less significant than before. Alright, so, now let’s do the same thing for R2, let’s say this now only has 10 points. Let’s do the ttest again. Alright, so, this is still significant. Let’s increase the variance. Alright, so, I had to increase the variance a lot to get an insignificant p value, so that’s one thing about when you’re comparing two gaussian distributions you can’t really say one is bigger than the other if they’re spread out a lot. So let’s put the variance back for R2, but let’s say the mean is now less far away from R1’s mean. Let’s do our ttest again, and so this also gives us an insignificant p value. So that’s another fact about the ttest is you also can’t tell if two distributions are different if they are very close together. If they’re very far apart, so the mean is, let’s say we do the ttest again, we now get a very small p-value.
In this lecture we’re going to talk more about the Gaussian distribution. So the first thing we’ll ask is we’ve generated data so far from uniform distributions, how can we generate data for random normal distributions? So there’s a function in matlab called randn that will do this. So if you type in randn it should give you different function signatures. So let’s say we want to generate a thousand random normal values. Okay, and so this can represent any number of random random values. So if you download data from Kaggle or some other data set where you want to do machine learning or statistics on, you want to know the distribution of your input. In this case, we want to know if the distribution is normal or not. So one thing we could do to check that is just to plot it, plot the histogram. Alright, so, you can see that it looks relatively normal. We can use more bins to get a finer granularity. So we still see this same type of distribution even with 50 bins, and of course if we increase n our curve looks closer and closer to the Gaussian curve that we saw from the last lecture. Now of course in the real world your data is not going to look this nice and clean since your distribution may or may not be Gaussian distributed. The question now is how do you test if your data is Gaussian distributed or not? So we’re going to talk a little bit about hypothesis testing, but not too much since it’s a pretty long and difficult subject. So the basic idea with hypothesis testing is you have two different hypotheses. So in this lecture we’re not going to talk about the mechanics behind statistical testing, we’re going to kind of work backwards and jump right into statistical tests that you can use in matlab right away. So the first is a Jarque-Bera test. In matlab it’s the function jbtest. So you can use it to return a hypothesis and a p-value. The hypothesis will be whether or not your data is randomly distributed, and it returns a p-value to tell you the strength of that hypothesis. So let’s try this on our data that we just generated. Alright, so, one thing that we need to talk about is how to interpret the return values of the Jarque-Bera test. So, when we talk about hypothesis testing you’ll see that there is a null hypothesis and an alternative hypothesis. The Jarque-Bera test will return one if we reject the null hypothesis, and the null hypothesis is that our data is normally distributed. So if we pass it in R, we get the hypothesis is the null hypothesis and a p-value of .5 which means we do not reject the null hypothesis that our data is normally distributed. There is another test called the Kolmogorov-Smirnov test which essentially does the same thing. So the null hypothesis again is that the data comes from a normal distribution, and the result h will be 1 if you reject the null hypothesis, so let’s try that. Same thing with kstest, right, and so this also rejects or does not reject the null hypothesis that the data is normally distributed. So now let’s do something interesting, let’s generate some random data that we know is not normally distributed. So, we’ll do this data from a uniform distribution. So now I want to do jbtest on uniform data right. So we get h equals 1 which means we do reject the null hypothesis, and our p-value is one one-thousandth, and usually a p value is determined significant when it’s less than five percent. So, what this is basically saying is that our uniform generated data is not Gaussian distributed which we know already is true. We can do the same test with Kolmogorov-Smirnov, and so this also rejects the null hypothesis with a much smaller p-value. Alright, and so this is how you can test if your data is normally distributed or not.
In this lecture we’re going to talk about a special continuous distribution called the normal distribution, or the Gaussian distribution. It probably looks very familiar to you since it is what most people refer to as the bell curve, and you’ve probably seen this in school where bell curves are used to shift marks up or down based on how well students perform. So this formula you see here is the PDF of the Gaussian distribution, notice how they also use the little f notation on Wikipedia. The interesting thing about the Gaussian distribution, so we talked about last time that the mean and the variance are two special numbers that help us describe what a continuous distribution looks like. With the Gaussian distribution the mean and the variance completely describe the shape of the distribution. So, the mean tells us where the center peak is of the bell curve, and the variance tells us how much that bell curve is spread out. So you can see this yellow curve is very spread out and the blue curve is spread out not that much. So let’s talk a little bit about this formula. First, there is a normalizing constant. It’s 1 over the square root of 2pi times Sigma which is the variance. Actually, Sigma stands for the standard deviation. Usually it’s written as 1 over the square root of 2 pi sigma squared where Sigma squared goes inside the square root, so Sigma squared is the variance and Sigma is a standard deviation. Second part of the PDF is this exponential. So we take the negative of X minus the mean which we denote by mu, square that, divide it by 2 Sigma squared or two times the variance, and then we exponentiate that. n=Note that since we square the thing where X is this PDF is symmetric, so if you go a distance from mean to the left, or that same distance to the right, you will get the same value for the PDF. So let’s do an exercise where we plot the values of a gaussian curve from say -100 to 100. So we’ll create a new function and call it my Gaussian. It will take in two parameters mu and sigma squared, and I will output an array. So n will be the number of different values between min_x and max_x. So we’re going to start our little x value at min x. And then we want to know how much to increment X on each iteration of the loop, so we’ll call that dx, and we’ll say it’s max_x minus min_x divided by n. So at the end of the loop we’re going to add dx to X. We’re going to call this f. Okay, so, we’re going to return the array of X values, also. So, X(i) is going to equal to x, and f(i) is going to equal, return x and f, so 1 over the square root of 2 pi sigma_sq, times exponential of negative x minus the mean squared, divided by 2 times sigma squared. So let’s do this for mu equals 0, Sigma squared equals 1, let’s say -10 to 10, and then have a thousand values between them. Okay, so, now we can plot x and f, alright so we get this bell curve. So the peak is at zero because that’s the mean, and then it’s spread out and from about -2 to 2, so the drop-off or how fast f of X goes to 0 is pretty quick. You can see the maximum value is about 0.4. Let’s try that again with a smaller variance, and some smaller values also for min_x and max_x. So let’s do 0.1 for sigma squared, let’s plot it again. Alright, so, the drop off of is even quicker now where we get to about 0 at -1 and 1, and notice the peak value is above 1.2. So since the PDF values above 1 are allowed.
In this lecture we’re going to talk about mean and variance. So how do we measure the distribution of continuous variables? We know that with discrete variables we can just check how often your random variable takes on a certain value and divide it by the total number of values to give you an approximation, or an estimation, to that value’s probability, but we can’t exactly do that with continuous variables. We can’t really count them and put them in the buckets like discrete variables. So we need to measure certain characteristics of the distributions that might tell us the shape of their distribution curve. So two common measures that you probably have already heard of are the mean and the variance, so first let’s talk about the mean. So the mean is like the average. It means you add up all the different values and you divide it by the total number of values, and that kind of tells you what the middle value would be. So what I want you to do is to load back up our random integer CSV. And so you know how to calculate the sum of all values, and we want to divide this by the total number of values. So the mean of r, so let’s just plot r again to remind ourselves what values that it can take, so -5 to 5, and so we get an average value of 0.35. So why is this value so small, because we generated values from -5 to 5. With a uniform distribution we could say that the distribution is balanced whether the value is less than 0 or greater than zero, so a lot of the numbers end up canceling each other out when you sum them, and therefore our mean value is about zero. Notice that we can never actually get the value .35 from our distribution right because our distribution gave us uniformly distributed numbers between -5 and 5, but only with discrete values. So you could get negative five, negative 4, negative 3, negative 2, negative 1, or 1 all the way up to 5, but you would never get the value .35. So it might seem odd that we also call the mean value the expected value even though we would never expect to get that actual value. We only expect that to be the average value of the values that we do draw from this random distribution, and of course there is an easier way to calculate the mean in MATLAB and it’s just the function mean. Right, so, we get the same value. Now let’s talk about the second measure of distribution variance. So we already talked about the mean which is kind of like the middle value, so we could say that it measures the centeredness of the distribution, so where the middle is, where the center is. Variance does a different thing, it measures the spread. So, one thing tells us where the random variable goes and that’s the mean, and then the variance tells us how much it’s spread out from that middle value. So the definition of the variance is it’s the expected value of the random variable minus its mean, and then we square that whole thing, and of course that doesn’t help us much since it doesn’t tell us how to calculate or estimate the variance, but it’s very similar to how we do the mean. So, let’s set the mean of r to be mean(R) like that. We can then calculate the variance of R by taking R minus the mean of R, and so we want to do the dot product between these two which is effectively squaring all the individual values and then multiplying by the corresponding value like that. So it’s treating R as a vector, and then finding the dot product with itself after subtracting the mean. In other words, it’s like subtracting the mean and then taking the squared distance, or the squared length, and of course we divide by the length of R. So we get 10.6475. Of course MATLAB has it’s own variance function, so we didn’t really need to do all this work, and it’s just var. So we get a pretty similar value. One thing that comes up in statistics which is a little bit outside of the scope of our discussion is that if we divide instead of by n we divide by n minus 1, that gives us what we call an unbiased estimate of the variance, and so this gives us the exact value of matlab’s version of R.
In this lecture we’re going to talk about continuous variables, so we’ve talked about discrete variables up until now. Discrete variables can only take on distinct values, but continuous variables can take on any value. So with continuous variables we don’t have a notion of probabilities for exact values because X can take on an infinite number of values, so the probability of equaling any specific exact value is zero. We can have probabilities for ranges though. So, for example, we can say the probability of X being between 3.13 and 3.15 is greater than zero. We have a useful function called the cumulative distribution function, or the CDF, that helps us measure such probabilities. We usually label this function as big F of X, and so the definition of F(X) is it’s the probability that the random variable big X is greater than negative infinity, but less than little x. Note that the probability of big X being between negative infinity and positive infinity is 1 since X has to take on a value, therefore the value of big F of positive infinity is equal to 1. Now how about going back to our original problem if we want to calculate the probability that X is between 3.13 and 3.15. That would just be big F of 3.15 minus big F of 3.13. So now let’s talk about the other useful function when we’re talking about continuous variables. This one’s called the probability density function, or the PDF. We usually denote it by little f of X, and it is defined as the derivative of big F of X with respect to X, so it’s like the slope of big F of X. Note that this function can be greater than one since it’s not a probability, it is a probability density f of X, little f of X does have to be greater than or equal to 0 though. So here’s one example where little f of X can be bigger than one. So, let’s say little f of X is uniform between zero and 0.1, so that means if you try to sample from this random variable X you’ll always get a value between zero and 0.1, and the probability of any particular value is equal to all the others. Now I’m going to claim that little f of x has to equal 10 if X is between 0 and 0.1, and 0 otherwise. Now why is this, because big F of X. Since little f of X is the derivative of big F of X, big F of X is the integral of little f of X. In the integral we can take the constant out and then calculate the integral from 0 to X. Now we know that from above big F of infinity has to equal 1, so the integral from minus infinity to infinity equals to 1, but since little f of x is 0 after 0.1. we can just take the integral from 0 to 0.1. That gives us 0.1c, and if we solve for c, c equals 10. Therefore, we’ve seen a scenario where little f of X can have a value greater than one because it’s a probability density, and not a probability value. Later on in this course we’ll look at more complex continuous distributions.
One interesting and popular problem in probability is called the birthday problem, or the birthday paradox, and the problem goes something like this. So, given a classroom of n students, what is the probability that at least one pair of students shares a birthday? Now you might be surprised that at N equals 23 the probability is about fifty percent, which is why this problem is called a paradox. So that means in an average sized classroom there’s probably a pretty good chance that two people in the class share a birthday. This is counterintuitive since there are 365 days in a year. In this lecture I’ll show you the theory behind the solution, and how to visualize it in MATLAB. So the first thing is the problem of at least one pair of people sharing a birthday is difficult, but remember that the probability of all distinct events have to add up to one. So what is the opposite of at least one pair of people sharing a birthday? It’s not two people sharing a birthday, or three people sharing a birthday, or two pairs of two people sharing a birthday, these events all fit into at least one pair sharing a birthday. So the two disjoint events that we want to talk about are the probability that at least one pair shares a birthday, and nobody shares a birthday. So these two events are disjoint and therefore they have to add up to one. So we can calculate then the probability that two people or at least one pair of people shares a birthday as 1 minus the probability that nobody shares a birthday, and so in mathematics we would call this a counting problem. So now let’s think about how do we calculate the probability that nobody shares a birthday. So there are 365 days in a year. Now if you think of each day as a bucket we have one person and they have 365 buckets to choose from. The probability that this one person will collide with another is 0. The probability that one person shares a birthday with somebody else when there’s only that one person is 0, so the probability that nobody shares a birthday in this case is 1, or 365.365. Now what about two people? So with one person already having chosen a birthday or a bucket, the second person has a 1/365 chance of colliding with that person. So the probability of at least one pair having a common birthday is 364/365. So this is the case with two people. Now if we have three people the third person only has 363 buckets to choose from. So we have 364/365 times 363/365, and we can multiply these probabilities because they are independent. So this is the probability that three people, and a group of three people, at least one pair would share a birthday. So we can continue this pattern but it would probably be easier to write a matlab function to do this. So my birthday function is going to return all the probabilities up to the value n. So I’m going to initialize a to be an array of zeros, and I’m going to count up to n, and fill in the values of a. Actually, I’m going to say one is because we’re subtracting from one. Actually, it doesn’t matter what I initialize date to because i’m going to say 1 minus over here. Okay, so, we know that we have to multiply by something over 365 each time and subtract that from one, and so we can use a for loop to iterate over the thing that has to be subtracted and multiply the new value iteratively. Now that we have our function let’s test it, so let’s say a equals birthday, and let’s set n equal to 100, and let’s plot. Okay, so, you can see here when n is about 23 you get the probability around 0.5. When n is equal to 50 you’re right above 90%, so there’s a pretty good chance in a group of 50 people that at least one of those pairs of people shares a birthday. So let’s check A(23), right it’s about 50%, and that is the solution to the birthday paradox.
So in this class we’re going to talk about generating a random variable from a certain distribution. This could be useful for doing simulations of systems that have uncertainty. Matlab has some built-in functions to help us do this. So the first one we’re going to talk about is called randi, and it takes one argument called imax, and this function gives us a uniformly distributed variable between one and imax. So let’s try it. So 9 is in between 1 and 10. Now there’s another function randi which takes in a maximum value, and another parameter called n. So let’s set n to 3, so that returns an n-by-n matrix of random values between 1 and imax. So suppose I wanted to generate random values between 10 and 20, how would we do that? Because randi can give us values anywhere between 1 and imax, so what we could do is we could just add 10 to all the values that randi returns. So this gives us a 3 by 3 matrix with values only between 10 and 20. Another useful function is just rand by itself. So this function gives us a random number between 0 and 1, so it’s different from the previous one where we don’t get integers we get real numbers. rand returns a number with a uniform distribution, so the probability of getting point .25 is the same as the probability of getting 0.75. So let’s try and imply histograms for different values of n. Alright, so this is a histogram for random numbers between 0 and 1, and N equals 10 array. Alright, so it’s not quite uniformly distributed, let’s try a bigger n value. Alright, so immediately it starts looking more uniformly distributed as n increases, so let’s try a bigger n. Alright, so it looks even more uniformly distributed. Now, 10,000. Alright, so it’s almost flat even. That’s 100,000, and this is a million, it looks almost perfectly flat. So that’s the idea with the frequentist view of probability is that when n approaches infinity, your probabilities approach their true values. So now let’s think about a different problem. Suppose I want a specific discrete distribution, so say I want to simulate an unfair coin. So, to write it out I want p of heads equal to .25, and I want p of tails to equal 0.75. How could I write a function to give me random values that could draw from this distribution instead of a uniform distribution? So we can create a function to do this. We can call it biased coin, it’s going to take in one value little p which represents, let’s call it P heads which is probability of getting heads, and it’s going to return the coin face. So we’re going to generate a random value, if it’s less than P heads we’re going to return heads, else we’re going to return tails. Let’s try our function. Alright, so now we’re going to try our new biased coin function by initializing an array of say size 1,000…you know what we’re going to do this in a separate function. We’re going to initialize a n by 1 array, we’re going to count from 1 to n, and we’re going to use the biased coin function to generate a value for each element of the array. Alright, so let’s try the function we just made. Test coin .25 for n equal to 1,000. Alright, so you see the number of heads which resolves to the integer 104, and then tails resolves to the integer 116. So you see this is about 250 and this is about 750 which is what we would expect in the thousand coin tosses.