
1. **Multivariate Derivatives** :silhouettes:
1.1. **Convexity**
1.1.1. **Derivative Condition** Hf is called positive semi-definite if  And this implies f is convex
1.1.2. **A Warning** When going to more complex models i.e. Neural Networks, there are many local minima & many saddle points. So they are not convex
1.1.3. Opposite is Concave 
1.1.4.  the function is convex if the line between two points stays above
1.1.5. **Benefits** When a function is convex then there is * a single **unique** local minimum * no maxima * no saddle points * **Gradient descent** is guarantied to find the global minimum with  small enough * Newton's Method always works
1.2. **The Gradient**
1.2.1. **Definitions**
1.2.1.1. **Matrix derivative** 
1.2.1.2. **Vector derivative**  
1.2.1.3. **The Gradient** collection of partial derivatives  
1.2.2. **Gradient Descent**  
1.2.3. **Level set** 
1.2.4. **Visualize** 
1.2.5. **Key Properties**
1.2.5.1.  points in the direction of **maximum increase**
1.2.5.2. - points in the direction of **maximum decrease**
1.2.5.3.  at local max & min
1.3. **Second Derivative**
1.3.1. **Hessian**
1.3.1.1. **2D intuition**
1.3.1.1.1.   Hf= 
1.3.1.1.2. **Critical points**  =0
1.3.1.1.3.   Hf= 
1.3.1.1.4.   Hf= 
1.3.1.2. 
1.3.1.3. If the matrix is diagonal, a positive entry is a direction where it curves up, and a negative entry is a direction where it curves down
1.3.2. **Trace** sum of diagonal terms tr(Hf)
1.3.3. For , there is  many derivatives
1.3.4. 
1.4. If you have a function  where n - n-dimensional vector, Then it's a function of many variables. You need to know how the function responds to changes in all of them. The majority of this will be just bookkeeping, but will be terribly messy bookkeeping.
1.5. **Partial Derivatives** is a measure of the rate of change of the function... when one of the variables is subjected to a small change but the others are kept constant. **Example**: 
1.6. **Newton Method**
1.6.1.  
1.6.2. The computational complexity of inverting an *n x n* matrix is not actually known, but the best-known algorithm is *O(n^2.373)* For high dimensional data sets, anything past linear time in the dimensions is often impractical, so Newton's Method is reserved for a few hundred dimentions at most.
1.7. **Matrix Calculus** 
1.7.1. 
1.7.1.1. 
1.7.2. 
1.7.2.1. 
1.7.3. 
1.7.3.1. 
1.7.4. 
1.7.4.1. 
2. **Vectors** :arrow_right:
3. **Univariate Derivatives** :silhouette:
3.1. **Newthon's method**
3.1.1. **Idea**
3.1.1.1. Minimizing f <---> f '(x)=0
3.1.1.1.1. Now look for an algorithm to find the zero of some function g(x)
3.1.1.1.2. Apply this algorithm to f '(x)
3.1.2. **Computing the Line**
3.1.2.1. line: on (x 0, g(x 0)) slope g '(x 0) y=g '(x 0) (x-x 0)+g(x 0) solve the equation y=0
3.1.2.2. 
3.1.3. **Relationship to Gradient Descent** A learning rate is adaptive to f(x) 
3.1.4. **Update Step for Zero Finding**
3.1.4.1. we want to find where **g(x)=0** and we start with some initial guess x0 and then iterate
3.1.4.2. 
3.1.5. **Pictorially** g(x) x such that g(x)=0 
3.1.6. **Update Step for Minimization**
3.1.6.1. To minimize **f**, we want to find where **f '(x)=0** and thus we may start with some initial guess x0 and then iterate Newton's Method on **f '** to get
3.1.6.2. 
3.2. **Gradient Descent**
3.2.1. As simplistic as this is, **almost all** machine learning you have heard of use **some version** of this in the **learning process**
3.2.2. **Goal**: Minimize f(x)
3.2.3. **Issues**:
3.2.3.1. * how to pick eta
3.2.3.2. * recall that an **improperly chosen** learning rate will cause the entire optimization procedure to either **fail** or **operate too slowly** to be of practical use.  
3.2.3.3. * Sometimes we can **circumvent** this issue.
3.2.4. **ALGORITHM**
3.2.4.1. **1**. Start with a guess of X0
3.2.4.2. **2.** Iterate through   - is learning rate
3.2.4.3. **3.** Stop after some condition is met * if the value if x doesn't change more than 0.001 * a fixed number of steps * fancier things TBD
3.3. **Maximum Likelihood Estimation**
3.3.1. 
3.3.2. find **p** such that **Pp(D)** is maximized
3.4. **Second Derivative**
3.4.1. **f''(x)** shows how the slope is changing
3.4.1.1. **max** -> f '' < 0 **min** -> f '' > 0 **Can't tell** -> f '' = 0 proceed with higher derivative
3.5. **Derivative**
3.5.1. Can be presented as: 
3.5.2. **Interpretation**
3.5.2.1. 
3.5.2.2.  
3.5.2.3. 
3.5.3. let's approximate 
3.5.4. better approximation  **f(x +є) = f(x) + f'(x)є**
3.5.5. **Rules**
3.5.5.1. **Chain Rule** 
3.5.5.1.1. **Alternative** 
3.5.5.2. **Product Rule** 
3.5.5.3. **Sum Rule** 
3.5.5.4. **Quotient Rule** 
3.5.6. **Most usable**
3.5.6.1. http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html
3.5.6.2. 
3.5.6.3. 
3.5.6.4. 
4. **Matrices** :bookmark_tabs:
4.1. **A motivating example**
4.2. **Matrix multiplication and examples** 
4.3. **Dot product and how to extract angles**
4.4. **Matrix product properties**
4.4.1. **Matrix Products**
4.4.1.1. **Distributativity** A(B+C) = AB +AC
4.4.1.2. **Associativity** A(BC)=(AB)C
4.4.1.3. **Not commutativity**  AB!=BA
4.4.2. **The Identity Matrix** IA=A   All ones on the diagonal
4.4.3. **Properties of the Hadamard Product**
4.4.3.1. **Distributativity** A(B+C) = AB +AC
4.4.3.2. **Associativity** A(BC)=(AB)C
4.4.3.3. **Commutativity** AoB=BoC
4.5. **Hadamard product**
4.5.1. An (often less useful) method of multiplying matrices is element-wise **AoB**
4.6. **Determinant computation**
4.6.1. **The Two-by-two** det(A)=ad-bc
4.6.2. **Larger Matrices** m determinants of (m-1)x(m-1) matrices computer does it simplier Q(m^3) times called matrix factorizations
4.7. **Linear dependence**
4.7.1. det(A)=0 only if columns of A are linearly dependent
4.7.2. **Definition** lies in lower dimentional space if there are some a-s, that a1\*v1+a2\*v2+...+ak\*vk=0
4.7.3. **Example**    a1=1, a2=-2, a3=-1
4.8. **Geometry of matrix operations**
4.8.1. **Intuition from Two Dimentions**
4.8.1.1. Suppose A is 2x2 matrix (mapping R^2 to itself). Any such matrix can be expressed uniquely as a **stretching**, followed by a **skewing**, followed by a **rotation**
4.8.1.2. Any vector can be written as a sum scalar multiple of two specific vectors  A applied to any vector 
4.8.2. **The Determinant** det(A) is the factor the area is multiplied by det(A) is negative if it flips the plane over
4.9. **Matrix invertibility**
4.9.1. **When can you invert?** it can be done only when det != 0
4.9.2. **How to Compute the Inverse** A^(-1)*A=I 
5. **Probability** :game_die:
5.1. **Axioms of probability**
5.1.1. 2. Something always happens.
5.1.2. 1. The fraction of the times an event occurs is between 0 and 1.
5.1.3. 3. If two events can't happen at the same time (disjoint events), then the fraction of the time that _at least one_ of them occurs is the sum of the fraction of the time either one occurs separately.
5.2. **Terminology**
5.2.1. **Outcome** A single possibility from the experiment
5.2.2. **Sample Space** The set of all possible outcomes _Capital Omega_
5.2.3. **Event** Something you can observe with a yes/no answer _Capital E_
5.2.4. **Probability** Fraction of an experiment where an event occurs _P{E} є [0,1]_
5.3. **Visualizing Probability** using Venn diagram
5.3.1. **Inclusion/Exclusion** 
5.3.1.1. **Intersection** of two sets  
5.3.1.2. **Union** of two sets  
5.3.1.3. **Symmetric difference** of two sets  
5.3.1.4. **Relative complement** of A (left) in B (right)  
5.3.1.5. **Absolute complement** of A in U  
5.3.2. **General Picture** Sample Space <-> Region Outcomes <-> Points Events <-> Subregion Disjoint events <-> Disjoint subregions Probability <-> Area of subregion
5.4. **Conditional probability**
5.4.1. If I know B occurred, the probability that A occurred is the fraction of the area of B which is occupied by A
5.4.2. 
5.4.3. 
5.5. **Intuition:** The probability of an event is the expected fraction of time that the outcome would occur with repeated experiments.
5.6. **Building machine learning models**
5.6.1. Maximum Likelihood Estimation
5.6.1.1. Given a probability model with some vector of parameters (Theta) and observed data **D**, the best fitting model is the one that maximizes the probability 
5.7. **Bayes’ rule**
5.7.1. can be leveraged to understand **competing hypotheses**
5.7.2. odds is fraction of two probabilities i.e. 2/1
5.7.3. Posterior odds = ratio of probability of generating data * prior odds
5.7.4. 
5.8. **Independence**
5.8.1. Two events are **independent** if one event doesn't influence the other
5.8.2. A and B are independent if P{AnB}=P{A}*P{B}
5.9. **Chebyshev’s inequality**
5.9.1. For _any_ random variable X (no assumptions) at least 99 of the time 
5.9.2. 
5.10. **The Gaussian curve**
5.10.1. **Key Properties**
5.10.1.1. Central limit theorem
5.10.1.1.1. is a statistical theory states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.
5.10.1.2. Maximum entropy distribution
5.10.1.2.1. Amongst **all** continuous RV with E[X]=0, Var[X]=1. H(X) Entropy is maximized uniquely for X~N(0,1)
5.10.1.2.2. Gaussian is the **most** Randon RV with fixed mean and variance
5.10.2. **General Gaussian Density**
5.10.2.1. 
5.10.2.2. 
5.10.3. **Standard Gaussian (Normal Distribution) Density**
5.10.3.1. 
5.10.3.2. E[X]=0 Var[X]=1
5.10.3.3. 
5.11. **Random variables**
5.11.1. is a function X that takes in an outcome and gives a number back
5.11.2. Discrete X takes only at most countable many values, usually only a finite set of values
5.11.3. **Expected Value** mean 
5.11.4. **Variance** how close to the mean are samples 
5.11.5. **Standard Deviation** 
5.12. **Entropy**
5.12.1. **Entropy** (*H*) 
5.12.2. **The Only Choice was the Units** firstly you need to choose the base for the logarithm. If the base is not 2, then *Entropy* should be divided by log2 of the base
5.12.3. Examples
5.12.3.1. **One coin** ***Entropy*** = one bit of randomness
5.12.3.1.1. H (1/2)
5.12.3.1.2. T (1/2)
5.12.3.2. **Two coins** ***Entropy*** = 2 bits of randomness
5.12.3.2.1. H
5.12.3.2.2. T
5.12.3.3. **A mixed case** ***Entropy*** = 1.5 bits of randomnes = =1/2(1 bit) + 1/2(2 bits)
5.12.3.3.1. H (1/2)
5.12.3.3.2. T
5.12.4. **Examine the Trees** if we flip n coins, then P=1/2^n \# coin flips = -log2(P)
5.13. **Continuous random variables**
5.13.1. For many applications ML works with **continuous random variables** (measurement with real numbers).
5.13.2. **Probability density function** 
5.13.3. 