ECE 8550: Assignment 1
1 Linear regression
In this section, you will implement linear regression to predict the death rate from several features, including annual precipitation, temperature, population, income, pollution, etc. The data for this assignment is given in file data.txt, which contains the data description, 17 columns (features) and 60 rows (examples). In the data matrix, columns 2-16 are input features, and column 17 is the output target. Note that Column 1 is the index and should not be used in the regression. You will implement gradient descent for learning the linear regression model.
We split the data to a training set and a testing set following the 80/20 rule. Thus, we have 48 training samples (n = 48) with 16 features (d = 16). The i-th sample is denoted by xi where x
j i is the value of the j-th feature this sample.
1.1 Feature normalization
By looking at the feature values in the data, you may notice that some features are about 1000 times the other features. When features differ by orders of magnitude, first performing feature normalization can make gradient descent converge much more quickly. There are different ways to do the feature normalization. For simplicity, we use the following method: (1) subtract the minimal value of each feature; (2) divide the feature values by their ranges of values. Equivalently, it is given by the following equation:
for each X, let Xnorm = X −Xmin
Xmax −Xmin
Similarly normalize the target Y . Note that to make prediction for a new data point, you also need to first normalize the features as what you did previously for the training dataset.
1
1.2 Feature appending
Classic regression is defined as
f(xi) = w · xi + b = d∑
j=1
wj ×xji + b,
where bi is the intercept. For simplicity, we let b = w d+1 ×xd+1i where x
d+1 i = 1 and w
d+1 is unknown but learnable parameter. Thus, we can combine w and b via letting
w ←[w1, w2, w3, . . . ,wd,wd+1]T ; xi ←[x1i , x
2 i , x
3 i , . . . , x
d i , 1]
T .
1.3 Gradient descent with ridge regularization
To recap, the prediction function for linear regression is given by
f(xi) = w · xi = wT xi = d+1∑ j=1
wj ×xji, (1)
where
• xi is the feature vecter of the i-th sample;
• xji is the value of the j-th feature for the i-th sample;
• wj is the j-th parameter of w;
The loss function for linear regression with ridge regularization is given by
J(w) = 1
2n
n∑ i=1
(f(xi) −yi) 2
+ λ
2(d + 1)
d+1∑ j=1
(wj)2. (2)
To minimize this function, we first obtain the gradient with respect to each parameter wj as:
∂J(w)
∂wj =
1
n
n∑ i=1
(f(xi) −yi) x j i +
λ
d + 1 wj. (3)
Then, the (full) gradient descent algorithm is given as:
where α is the learning rate. The convergence criteria for the above algorithm is ∆%cost < �, where
∆%cost = |Jk−1(w) −Jk(w)|× 100
Jk−1(w) ,
2
Algorithm 1: Batch Gradient Descent
k = 0; while convergence criteria not reached do
for j = 1, . . . ,m do
Update wj ← wj −α∂J(w) ∂wj
Update k ← k + 1;
where Jk(w) is the value of Eq. (2) at k-th iteration, and ∆%cost is computed at the end of each iteration of the while loop. Initialize w = 0 and compute J0(w) with these values.
Important: You must simultaneously update wj for all j.
Task: Load the dataset in data.txt, split it into 80% / 20% training/test. The dataset is already shuffled so you can simply use the first 80% examples for training and the remaining 20% for test. Learn the linear regression model using the training data. Plot the value of loss function Jk(w) vs. the number of iterations (k). For each iteration, compute the mean squared loss (w/o regularization function) on the test data. Plot the mean squared loss over the test data v.s. the number of iterations (k) in the same figure. The mean squared loss is defined as
MSE = 1
2m
m∑ i=1
[yi −w · xi] 2 .
1.4 Gradient descent with lasso regularization
The loss function for linear regression with lasso regularization is given by
J(w) = 1
2n
n∑ i=1
(yi −f(xi)) 2
+ λ
2(d + 1)
d+1∑ j=1
|wj|. (4)
To minimize the loss function, you will need to derive the gradient by yourself. The gradient descent algorithm is the same as the above.
Hint: For simplicity, you can consider ∂|wj| wj
= 1 when wj = 0.
Task: Load the dataset in data.txt, split it into 80% / 20% training/test,. Learn the lasso regression model using the training data. Plot the value of loss function Jk(w) vs. the number of iterations (k). Plot the mean squared loss over the test data v.s. the number of iterations (k) in the same figure.
3
1.5 What to submit
In this assignment, use α = 0.01 and � = 0.001. Use λ = 2 for ridge regularization, and use λ = 0.3 for lasso regularization.
1. (50 pts) The source code (.py) without running errors in Anaconda 2021.11 (Python 3.9).
2. (20 pts) Plot the value of loss function Jk(w) and MSE vs. the number of iterations (k) for Section 1.2, report the squared loss on the test data for Section 1.3.
3. (10 pts) The number of elements in w whose absoluate value is smaller than 0.01 for Section 1.3.
4. (10 pts) Equation for the gradient of Eq. (4) which should be similar to Eq. 3.
5. (10 pts) Plot the value of loss function Jk(w) and MSE vs. the number of iterations (k) for Section 1.3, report the squared loss on the test data for Section 1.4.
6. (10 pts) The number of elements in w whose absoluate value is smaller than 0.01 for Section 1.4.
1.6 Requirements
1. This assignment is implemented using Python in Anaconda 2021.11 (Python 3.9).
2. No libraries are allowed except Python built-ins, numpy, and matplotlib.
3. Figures, equations, and numbers should be in a PDF file.
4. Zip the Python source code (.py) and your PDF file before uploading.
4
—-
Assignment 1 for ECE 8550
1 Regression Linear
In this section, you will use linear regression to forecast the death rate based on a variety of factors such as annual precipitation, temperature, population, wealth, pollution, and so on. This assignment’s data is contained in the file data.txt, which comprises the data description, 17 columns (features), and 60 rows (examples). Columns 2-16 of the data matrix are input features, while column 17 is the output target. It is important to note that Column 1 is the index and should not be used in the regression. For learning the linear regression model, you will use gradient descent.
We divided the data into two sets, one for training and one for testing, using the 80/20 approach. As a result, there are 48 training examples (n = 48) with 16 features (d = 16). The