The optimization set: Gauss-Newton method

This log is an extension for the least square problems (introduced in a previous log: The optimization set: QR factorization – ToothlessOS Log). Specifically, we will look at how to apply the results to non-linear regression problems.

Theory

Least-square problems

We recall that the least square solution is an approximate solution for overdetermined linear system Ax = b, with the criteria of minimizing the residue r:

    \[\min || r ||^2 = || Ax - b ||^2 \]

For these problems, we can compute the approximate solution easily: \hat{x} = A^{\dagger}b = {(A^{T}A)}^{-1}A^{T}b.

The Gauss-Newton method

Unfortunately, sometimes we cannot formulate problems in this nice linear form. Therefore, we have to make some adjustments: introducing the gauss-newton method!

That sounds really familiar to the Newton-Raphson solver mentioned in the previous log (The optimization set: Non-linear solvers – ToothlessOS Log). Actually, the ideas behind them are quite similar, in terms of the linear approximation used.

The following content is taken from Prof. Peng Sun’s slides of MATH304 at DKU.

Practice

In the practice section, we will implement the gauss-newton method on GDP data of China from scratch with MATLAB. This section is designed with reference to: Non Linear Regression Analysis | My Data Science Projects.

Step 1: Data preparation

For the dataset, we will use the China GDP data from 1960 to 2014. The dataset can downloaded here.

% Data preparation
data = readtable("china_gdp.csv");

% Preview
data
plot(data, "Year", "Value");

% Normalization
data.Year = data.Year / max(data.Year);
data.Value = data.Value / max(data.Value);
plot(data, "Year", "Value");

(Data preview)

(Normalized)

Step 2: Choose a model

Based on the growth trend of the data, we choose 2 non-linear models to fix it: the exponential function and the sigmoid/logistic function.

The exponential function:

    \[f(x|a,b,c) = a + bc^x\]

, where b \ne 0, c > 0, c \ne 1 and x is any real number

Now, in order to construct the matrix [Z_{j}], we want to compute the partial derivatives of f(x) with respect to a, b and c.

    \[\frac{\partial f(x)}{\partial a} = 1\]

    \[\frac{\partial f(x)}{\partial b} = c^x\]

    \[\frac{\partial f(x)}{\partial c} = bxc^{x-1}\]

The sigmoid function:

    \[f(x|a,b,c,d) = a + \frac{b}{1+c^{x-d}}\]

Similarly, we need to find the partial derivatives:

    \[\frac{\partial f(x)}{\partial a} = 1\]

    \[\frac{\partial f(x)}{\partial b} = \frac{1}{1+c^{x-d}}\]

    \[\frac{\partial f(x)}{\partial c} = -b(1+c^{x-d})^{-2}(x-d)c^{x-d-1}\]

    \[\frac{\partial f(x)}{\partial d} = b(1+c^{x-d})^{-2}c^{x-d}ln(c)\]

Step 3: Implementation

When playing around with the implementation, I find that preprocessing of the data is very important. If the data is not aligned with the target function , we could face horrible divergence and numerical instability issues. This is especially important for exponential functions which can explodes to infinity!

Therefore, the implementation is a bit different for here. We fix the parameters a = 0 and b = 1 and only update c, making it the simple form of the exponential function. At the same time, we shift and scale the original dataset so that it passed through (1, 1), aligning it with the exponential function. After this, we are able to get better fitting results.

% Fitting with exponential functions
% Data preparation
data = readtable("china_gdp.csv");

% Preview
data
plot(data, "Year", "Value");

% Normalization (scale and shift)
data_norm = data;
data_norm.Year = (data.Year / max(data.Year) - 1) * 200;
data_norm.Value = data.Value / max(data.Value);
plot(data_norm, "Year", "Value");

% Case 1: Using exponential function
% Initialize the parameters
a = 0;
b = 1;
c = 1;

% Parameter delta matrix
delta_A = zeros(1,1);

% Convergence threshold
eps = 1e-5;

% Set the max iterations here
for i=1:100
    % Jacobian(Derivatives) matrix
    % Vectorized implementation
    Z = zeros(length(data_norm.Year),3);
    % Partial derivative: a
    Z(:,1) = 1;
    % Partial derivative: b
    Z(:,2) = c.^data_norm.Year;
    % Partial derivative: c
    Z(:,3) = b.*data_norm.Year.*c.^(data_norm.Year-1);
    
    % The difference matrix
    D = zeros(length(data_norm.Year),1);
    D(:,1) = data_norm.Value - (a + b*c.^data_norm.Year);

    % Now, we can solve for least square solution delta_A
    % This tells us how to update the parameters
    % We use QR factorization here
    [Q, R] = qr(Z);
    delta_A = Q'*D\R;

    update = abs(delta_A(3)/c);

    % a = a + delta_A(1);
    % b = b + delta_A(2);
    c = c + delta_A(3);

    % Check convergence condition
    if update < eps
        disp("Convergence cond reach at " + i + " iteration")
        break
    end
end

disp("Update: " + update);
disp("Loss: " + norm(data_norm.Value - (a + b*c.^data_norm.Year)))
x_axis = linspace(-6,0,100);
y_axis = a + b*c.^x_axis;
plot(data_norm.Year, data_norm.Value, x_axis, y_axis);

Results: L2-Loss after normalization: 0.16822

(Visualization)

Unfortunately, we ran into divergence issues again with the sigmoid function.

% Data preparation
data = readtable("china_gdp.csv");

% Preview
data
plot(data, "Year", "Value");

% Normalization (scale and shift)
data_norm = data;
data_norm.Year = data.Year / max(data.Year);
data_norm.Value = data.Value / max(data.Value);
plot(data_norm, "Year", "Value");

% Case 1: Using exponential function
% Initialize the parameters
a = 0;
b = 1;
c = exp(1);
d = 0;

% Parameter delta matrix
delta_A = zeros(4,1);

% Convergence threshold
eps = 1e-5;

% Set the max iterations here
for i=1:1000
    % Jacobian(Derivatives) matrix
    % Vectorized implementation
    Z = zeros(length(data_norm.Year),4);
    % Partial derivative: a
    Z(:,1) = 1;
    % Partial derivative: b
    Z(:,2) = 1/(1+c.^(data_norm.Year-d));
    % Partial derivative: c
    Z(:,3) = -(b.*(data_norm.Year-d).*c.^(data_norm.Year-d-1))./(1+c.^(data_norm.Year-d)).^2;
    % Partial derivative: d
    Z(:,4) = (b.*c.^(data_norm.Year-d).*log(c))./(1+c.^(data_norm.Year-d)).^2;
    
    % The difference matrix
    D = zeros(length(data_norm.Year),1);
    D(:,1) = data_norm.Value - (a + b./(1+c.^(data_norm.Year-d)));

    % Now, we can solve for least square solution delta_A
    % This tells us how to update the parameters
    % We use QR factorization here
    [Q, R] = qr(Z);
    delta_A = Q'*D\R;

    update = max([abs(delta_A(1)/a) abs(delta_A(2)/b) abs(delta_A(3)/c) abs(delta_A(4)/d)]);

    a = a + delta_A(1);
    b = b + delta_A(2);
    c = c + delta_A(3);
    d = d + delta_A(4);

    % Check convergence condition
    if update < eps
        disp("Convergence cond reach at " + i + " iteration")
        break
    end
end

disp("Update: " + update);
disp("Loss: " + norm(data_norm.Value - (a + b/(1+c.^(data_norm.Year-d)))));
x_axis = linspace(0.97,1,100);
y_axis = a + b./(1+c.^(x_axis-d));
plot(data_norm.Year, data_norm.Value, x_axis, y_axis);

legend(["original", "fit"])

The divergence issue probability lies in the hidden assumptions of the Gauss-Newton method. For this method to work, the first-order Taylor approximation must be effective. Otherwise, it is possible that we get completely lost!


评论

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注