Calculus/Calculus on matrices

From testwiki
Jump to navigation Jump to search

This section gives an overview on how calculus can be applied to matrices. Note that a general understanding of Linear Algebra is expected - you're expected to be familiar with the common ways of manipulating matrices.

The problem

Consider a n by n matrix A and a n by 1 vector x. How can we, for instance, find ddxAx? Now, if you were to naively apply single-variable calculus rules, a plausible answer would be

ddxAx=A

After all, the corresponding scalar form of the problem ddx(ax) would indeed be a. And indeed the answer to the vector form is A. But now consider the following problem: ddAAx. If you were to take the scalar form, you'd probably think that the answer would be x. But that isn't right - the answer is actually xT, where T refers to the transpose of the vector x.

The purpose of this section is to scratch the surface of this beautiful field - because it's not something that the average Calculus 3 or linear algebra course at university will teach - yet it has its own quirks. And what is it used for? Matrix calculus is widely used in machine learning and also in other fields, such as computational finance. It can also help us avoid having to take (potentially nasty) Lagrangians and effectively reduce the problem to a single-variable scenario!

Derivative with respect to a vector

In this section, we consider problems that involve differentiating with a vector x. As with above, we assume that x is a column vector.

One way to think about this problem is to reduce this to a problem of scalars. Notice that we can consider x to be a collection of scalars x=(x1x2x3...xn). Now take the individual partial derivatives xi for 1in. Finally put them together. We're essentially finding f(x) after all - the steps are the same (only that before, the size of x was 2 or 3 that represented the i, j and k coordinate frame).

So let's try this from the above example. We want to find, for all 1in,

xi(Axi)

... which is A. And that's the same for every i.

Now what would get if you would combine all the partial derivatives? Just like how you'd find f(x)=f(x1x2x3...xn)=(Ax1Ax2Ax3...Axn), you'll get f(x)=(AAA...A). This is just A! Indeed, that's why ddxAx=A.

A first step towards matrices

Now we return back to the other problem: ddAAx.

Let's assume that A is a 2 by 2 matrix, and represent A as A=(a11a12a21a22). Using the same notation for x, perform matrix multiplication:

Ax=(a11a12a21a22)(x1x2)=(a11×x1+a12×x2a21×x1+a22×x2)

Take the partial derivatives with respect to each of the elements of A (this is equivalent to finding the Jacobian). For 1i,j2, find Aij:

A11=(x10), A12=(x20), A21=(0x1) and A22=(0x2)

But then what do you do with that? How do you "combine" the result? Clearly, we are missing something.

Dimensions of ∇f

Let's take a step back and ask the question: given a vector f=Ax, what should be the dimension of f?

Consider the example above. We have two variables: A with 4 elements (2x2) and x with 2 elements, and we want to find dfdA. It is straightforward to show that the dimension of f is a column vector with f=(f1f2), where (f1f2)=(a11×x1+a12×x2a21×x1+a22×x2). So we also need to consider the derivatives with respect to f1 and f2. In other words, the dimension of df1dA and df2dA is a 2x2 vector, corresponding to the partial derivatives of each element of the matrix A.

How many elements would dfdA have in total? For each of the two scalars that comprise f, there are four partial derivatives. This results in (2 * (2 * 2)) = 8 elements in total - actually a tensor (which can be thought of as a higher-order matrix). This is where things start to get messy, but fortunately this is a simple enough example.

Getting a solution

So let's use this above observation to solve the problem.

First consider df1dA, where f1=a11×x1+a12×x2. Compute the individual partial derivatives: f1A11=x1,f1A12=x2,f1A21=0 and f1A22=0. Similarly, f2A11=0,f2A12=0,f2A21=x1 and f2A22=x2.

Now, how do we combine this? The issue is that f is a tensor, but we can display only a 2D representation using matrices. So let's take the "face" that corresponds to f1. We can represent the collection of partial derivatives (that is, the Jacobian) we found above in a matrix: f1=(x1x200). Similarly, f2=(00x1x2). What do we observe? This is simply xT (notice the change from column to row form) ! And indeed, that's how we show that ddAAx=xT.

In practice

In practice, you won't have to do all this work every time you want to find the derivative of a matrix. Instead, there are many matrix cookbooks (sometimes also called as a cheatsheet), which give a table of common derivatives with respect to matrices, and that's what you're likely to require in practice. Here's one.

An example

Consider the Markowitz problem. Assume that we have n stocks, and we want to assign weights wi. The inter-element covariance between each stock is σij, for all 1i,jn. Suppose we want to solve this problem in a traditional way. Then the optimisation problem is to minimise i=1nj=1nwiσijwj subject to constraints, while important, we won't mention here.

Solving this problem is likely to be messy given the double summation. Let's try matrix calculus. Let w be a N times 1 vector and Σ be a N times N matrix. The above problem can be reduced to minimising wTΣw, and all you need to do is to take derivatives with respect to w! As shown above using a matrix calculus cookbook, ddwwTΣw=2wTΣ, which is much more elegant than trying to compute the individual partial derivatives.

Template:BookCat