SUMMATIVE ASSIGNMENT 2 – BUSI4AY15
Business Analytics
Masters Programmes 2024/25
For this assignment, you will be provided a data set in Excel and an Excel answer sheet. At the bottom of this assignment, you will find a list of exercises to execute on the data set provided. You are to input your answers into your Excel answer sheet and submit it on Blackboard.
SUBMISSION INSTRUCTIONS
A penalty will be applied for work uploaded after 11:59am as detailed in the Late submission policy. You must leave sufficient time to fully complete the upload process before the deadline and check that you have received a receipt. At peak periods, it can take up to 30 minutes for a receipt to be generated.
FORMAT
You are to submit the Excel answer sheet with the file name adequately changed as instructed. You are not to alter the structure of the Excel answer sheet. The Excel file should be kept in the .xlsx format.
MARKING GUIDELINES
The number of marks carried by each question in the exercise is indicated clearly in this assignment.
PLAGIARISM AND COLLUSION
Note that your data set is unique to you, correspondingly, the answers that you will obtain will also be unique to you.
Students suspected of plagiarism, either of published work or the work of other students, or of collusion will be dealt with according to School and University guidelines.
SPECIFIC INSTRUCTIONS
1. You will be able to find your data set for this assignment in the “Data Sets” folder. All of the files in this folder are named “yourZnumber_Number1_Number2.xlsx”. You are to find the data set corresponding to your Z number. Download this data set.
2. You are strongly advised to save a copy of the data set in a safe location, in the unlikely but potential event that you accidentally overwrite the data during the process of your analysis.
3. Note that all of your colleagues have been provided different datasets, and consequently will arrive at different correct answers for the assignment. As such, please ensure that you use the data set that corresponds to your Z number and not the data sets of any of your colleagues.
4. In the assignment folder, you will also find an Excel file that is labelled “yourZnumber_SA2.xlsx”. This is your answer sheet, which you use to fill your responses in for each question and which you will submit on Blackboard. Change yourZnumber in the file name to your Z number right now.
5. As your assignment is machine-graded, any minute error in your file name will render it unreadable by the machine and will lead to a complete loss of marks, so please ensure that the file name is correct.
6. The Excel answer file contains two columns. The first column lists the question numbers to which an answer is expected in the Excel answer file. In the second column, you are to key in your answer to the corresponding question there. Do not alter the structure of the Excel file, namely, do not add new rows or columns or key in any values outside of the demarcated area.
7. You will be required to key different types of answers in a specific form.
a. For numerical answers, please leave your answers with at least 3 significant figures (in other words, with at least 3 non-zero digits, e.g. you may reflect “12.345” as “12.3” and “0.01234” as “0.0123”). If your numerical answer is a whole number, leave them as such (e.g. you may leave “3” as “3” as opposed to “3.00” that is in 3 significant figures).
b. If you are required to report probabilities, for example, p-values, if the numerical value is smaller than 0.0001 or 1e-4, please report the value as 0. If you are asked to report probabilities or proportion or percentages, please reflect the value in decimals (e.g. report 78% as 0.78). Never reflect your answer as a fraction (e.g. for 1/3, instead use 0.333 in 3 significant figures).
c. For multiple choice questions in this assignment, feel free to leave your responses in any of large or small capitals.
8. Wrong answers have a potential to carry partial marks.
9. This assignment comprises a total of 43 questions, amounting to a total of 100 points.
ASSIGNMENT QUESTIONS
You will be able to download your data set from Blackboard. Your data set will contain a total of 600 data points. Please split the data set into a training and testing data set. The training data set will comprise the first 500 data points and the testing data set will comprise the last 100 data points. Unless explicitly stated that it is for testing, the training data set should always be used.
You are strongly advised to save a copy of the data set in a safe location, in the unlikely but potential event that you accidentally overwrite the data during the process of your analysis.
In this data frame, you will find 10 columns. The column ‘y’ is to be assumed as the outcome / dependent variable. The other columns labelled from ‘x1’ to ‘x9’ are to be assumed as the predictors / independent variables.
To verify that you have downloaded the correct data set and split the training and testing data correctly, please calculate the mean and variance of y for both the training and testing data sets. In the file name of your data set, you see the following “yourZnumber_Number1_Number2_Number3_Number4.xlsx”. Numbers 1 and 2 would be the mean and variance for y in the training data set; and numbers 3 and 4 are the mean and variances for the testing data set.
Do not proceed if these numbers are not correct! Verify with your tutor if they are incorrect.
For this assignment, unless otherwise stated, we will use a significance level of 5%.
Question 1
1 point
Find the mean of ‘y’.
Question 2
3 points
Compute the correlations of all 9 predictors ‘x1’ to ‘x9’ against the outcome ‘y’. Which of these predictors has the strongest correlation with ‘y’? Your answer should be in the format: ‘x<a number>’, e.g. ‘x1’.
Question 3
2 points
What is the consequence of answering Question 2 in the context of building a linear regression model?
Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.
A Predictors with a high correlation would indicate multi-collinearity and should not be included as predictors in the linear regression model.
B Predictors with a high correlation would indicate multi-collinearity and should be included as predictors in the linear regression model.
C Predictors with a low correlation would indicate a weak relationship with the outcome ‘y’ and should not be included as predictors in the linear regression model.
D Predictors with a low correlation would indicate a weak linear relationship with the outcome ‘y’ and should still be included as predictors in the linear regression model, as we cannot rule out deeper non-linear relationships with the outcome.
Question 4
3 points
Build a linear regression model with ‘y’ as the outcome and all of the ‘x’ variables as the predictors. Let us call this Model A. Report the adjusted R-squared value.
Question 5
1 point
What percentage of the variance in ‘y’ is not explained by your current linear regression model in Question 4?
Question 6
1 point
What is the p-value for the variable ‘x4’ in your linear regression model?
Question 7
2 points
Depending on your answer for Question 6, what can you conclude about whether or not there is a relationship between the outcome variable ‘y’ and the feature ‘x4’?
Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.
A We do not have enough data or evidence to say that there is a relationship between outcome ‘y’ and the feature ‘x4’.
B We do not have enough evidence to say that there is definitely no relationship between outcome ‘y’ and the feature ‘x4’, but we can at least say that we do not have enough data to conclude that this relationship is linear.
C There is a linear relationship between outcome ‘y’ and feature ‘x4’.
D There is a relationship between outcome ‘y’ and feature ‘x4’, but we do not yet have enough data to conclude if this relationship is linear.
Question 8
3 points
Build a linear regression model with ‘y’ as the outcome, but with only the top 5 ‘x’ variables with the highest magnitude of correlation with ‘y’. Let us call this Model B. Report the adjusted R-squared value.
Question 9
2 points
Run a test that verifies if the R-squared value for Model B is significantly different from the R-squared value for Model A. Report the associated p-value for this test.
Question 10
2 points
Depending on your answer for Question 9, what can you conclude about whether or not it is a good idea to omit the ‘x’ variables with small magnitude of correlation with the outcome ‘y’?
Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.
A The p-value is small, thus indicating that there is no significant difference between the models. As such, we are justified in omitting the ‘x’ variables.
B The p-value is small, thus indicating that there is a significant difference between the models. As such, we are not justified in omitting the ‘x’ variables.
C The p-value is large, thus indicating that there is no significant difference between the models. As such, we are justified in omitting the ‘x’ variables.
D The p-value is large, thus indicating that there is no significant difference between the models. However, we should not rush to the conclusion that these ‘x’ variables play no part in the outcome ‘y’.
Question 11
1 point
You would notice that there is an ‘x’ variable that has a negative correlation with the outcome ‘y’ and with magnitude at least 0.2. Which variable is that? Your answer should be in the format: ‘x<a number>’, e.g. ‘x1’.
Question 12
3 points
Using your analysis from Model A, plot a scatter plot of the residuals against the variable you identified in Question 11. What can you conclude from the plot?
Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.
A There are non-linearities in the model.
B There is multi-collinearity in the model.
C There is an interaction effect involving this variable in the model.
D None of the above assumptions of linear regression has been violated.
Question 13
3 points
In your data set, create a new variable that is the exponential of the variable you identified in Question 11. Run a linear regression model with all the ‘x’ variables plus this variable. Let us call this Model C. Report the adjusted R-squared value.
Question 14
1 point
Report the p-value for this new variable in Model C.
Question 15
2 points
Run a test that verifies if the R-squared value for Model C is significantly different from the R-squared value for Model A. Report the associated p-value for this test.
Question 16
1 point
Report the p-value for the variable you identified in Question 11 in Model C.
Question 17
1 point
Based on your answers from Questions 14 to 16, what can you conclude?
Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.
A Adding the new variable led to a significantly better model. However, we should not keep the original variable from Question 11, but keep only the new variable in the regression model.
B Adding the new variable led to a significantly better model. Moreover, both the original variable from Question 11 and the new variable should be retained in the regression model.
C Adding the new variable did not lead to a significantly better model. However, we should keep the new variable in the regression model.
D Adding the new variable did not lead to a significantly better model. Moreover, we should delete the variable in Question 11 and the new variable from the regression model.
Question 18
3 points
Plot two scatter plots as follows:
(I) The outcome ‘y’ against the predictor ‘x9’ only for the data points where ‘x1’ is larger or equal to 0
(II) The outcome ‘y’ against the predictor ‘x9’ only for the data points where ‘x1’ is less than 0
Compare the two scatter plots. What can you conclude from them?
Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.
A There are non-linearities in the model.
B There is multi-collinearity in the model.
C There is an interaction effect involving this variable in the model.
D None of the above assumptions of linear regression has been violated.
Question 19
5 points
Create a new variable that addresses the problem in Question 18. Now, modify Model C to include this new variable. Let us call this Model D. Report the adjusted R-squared value.
Question 20
2 points
Run a test that verifies if the R-squared value for Model D is significantly different from the R-squared value for Model C. Report the associated p-value for this test.
Question 21
3 points
Compute the correlations of all the other 8 predictors against the predictor ‘x8’. Which of these predictors has the strongest correlation with ‘x8’? Your answer should be in the format: ‘x<a number>’, e.g. ‘x1’.
Question 22
1 point
What might Question 21 be indicative of?
Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.
A There are non-linearities in the model.
B There is multi-collinearity in the model.
C There is an interaction effect involving this variable in the model.
D None of the above assumptions of linear regression has been violated.
Question 23
5 points
Using a more precise test than correlations, re-perform. the analysis in Question 21 using Model D. If the cut-off for the test is set at 5, how many variables should be omitted from the model?
Question 24
4 points
Using your analysis from Question 23, modify Model D to arrive at a new model, but keeping variable ‘x8’. Let us call this Model E. At this point, remove all non-significant variables from Model E. You should be left with 5 predictors. Report the adjusted R-squared value.
Question 25
2 points
Run a test to verify that Model E is indeed better than Model D. What can you conclude from your test?
Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.
A Removing the variables led to a significantly worse model, because the p-value of the test was significant.
B Removing the variables led to a significantly worse model, because the p-value of the test was not significant.
C Removing the variables did not lead to a significantly better model, because the p-value of the test was significant.
D Removing the variables did not lead to a significantly better model, because the p-value of the test was not significant.
Question 26
2 points
On the training data set, calculate the root mean squared error for Model E.
Question 27
2 points
On the testing data set, calculate the root mean squared error for Model E.
Checkpoint
Before proceeding onto the next part of this exercise, verify that you are now indeed left with just 5 predictors in Model E. These might not be ‘x1’ to ‘x9’ and might be some transformation of them. If you are not left with 5 predictors, you might want to verify your working. Additionally, you might want to remove any non-significant variables that are still left in Model E.
Copy the 5 predictors in Model E into a new sheet. To this new sheet, create a new variable that is 1 if ‘y’ is greater or equal to 0 and 0 if ‘y’ is less than 0. From now on, we will refer to this new binary variable as ‘z’, and it will be treated as the outcome variable.
Question 28
3 points
Build a logistic regression model with ‘z’ as the outcome variable and these 5 predictors. Call this Model F. Report the AUC.
Question 29
2 points
What is the significance of the value in Question 28?
Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.
A It represents the correlation between the outcome ‘z’ and 5 predictors, so the larger the value the better the model.
B It should not be less than 0.5, otherwise the model is no better than predicting every point as yes or no.
C It is an objective measure of the goodness of the model without having to resort to a testing data set; a high value will guarantee good performance even in a testing data set.
D A high value shows that there is a high trade-off between sensitivity and specificity, indicating that the model is predicting poorly.
Question 30
1 point
Set the cut-off probability to 0.5. What is the in-sample accuracy of Model F?
Question 31
1 point
Set the cut-off probability to 0.5. What is the out-of-sample accuracy of Model F?
Question 32
2 points
Based on your answer to Questions 30 and 31, what can you conclude about Model F?
Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.
A Model F seems to generalize out-of-sample as its out-of-sample accuracy is comparable to its in-sample accuracy.
B Model F does not generalize out-of-sample as its out-of-sample accuracy is not comparable to its in-sample accuracy.
C Model F seems to generalize out-of-sample as its out-of-sample accuracy is higher than its in-sample accuracy.
D Model F does not generalize out-of-sample as its out-of-sample accuracy is lower than its in-sample accuracy.
Question 33
1 point
Set the cut-off probability to 0.5. What is the in-sample recall of Model F?
Question 34
1 point
Set the cut-off probability to 0.5. What is the in-sample specificity of Model F?
Question 35
2 points
If the cut-off probability is changed to 0.4, how do we expect the answers to Questions 33 and 34 to change?
Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.
A The recall will increase and the specificity will decrease.
B The recall will increase and the specificity will increase.
C The recall will decrease and the specificity will decrease.
D The recall will decrease and the specificity will increase.
Question 36
5 points
In this example, we verify what happens if we use Model E to solve the classification problem of whether or not outcome ‘y’ is greater or equal to 0.
With Model E, form. predictions according to your final model with the same 5 predictors as in Model F on the testing data set. If the outcome or prediction was greater or equal to 0, then we record that as a positive, and vice versa. Based on this, form. a confusion matrix for Model E.
Report the out-of-sample accuracy for Model E.
Question 37
3 points
Compare your answers between Questions 31 and 36. What can you conclude?
Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.
A The accuracy for the linear regression model is higher because it is forms continuous predictions, which enables the decision maker to better draw the line dividing the regions where outcome ‘y’ is larger than 0 or not.
B The accuracy for the logistic regression model is higher because it solves the simpler problem of simply deciding if outcome ‘y’ is larger than 0 or not, rather than predicting the actual value of ‘y’. Thus, it can afford to be more precise.
C There is numerically a difference, but this difference is not significant, as both models use the same underlying linear structure to form. predictions and should therefore arrive at similar results.
D There is no reasonable way to conclude on the differences between these two models; it depends on how the predictors relate to the outcome.
Question 38
2 points
Build a classification tree model with ‘z’ as the outcome variable and the same 5 predictors. Call this Model G. Use default parameters of depth = 4, cp = 0.01, minsplit = 20, minbucket = 5. Report the out-of-sample accuracy.
Question 39
3 points
Compare your answers between Questions 31 and 38. What can you conclude?
Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.
A We can conclusively decide which model is better based on the accuracy, because this is an out-of-sample statistic.
B Even if the regression tree model posted a higher accuracy, it is too hasty to conclude that it has a better performance than the logistic regression model as one could potentially shift the cut-off probabilities in the latter to arrive at a better model.
C Even if the regression tree model posted a lower accuracy, it is too hasty to conclude that it has a poorer performance than the logistic regression model as one could still change the parameters of the tree.
D It is unreasonable to compare the accuracies as they arise from different models.
Question 40
2 points
Find the leaf that has a proportion of class ‘1’ that is closest to 50%. Report this proportion.
Question 41
3 points
What is the significance of your answer to Question 39?
Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.
A This proportion forms our benchmark. Our out-of-sample accuracy should minimally beat this proportion.
B If this proportion is very close to 50%, then we should reduce the depth of the tree so that the data points in this leaf can be diluted with other leaves.
C If this proportion is very far away from 50%, then this might be a sign of overfitting in the tree.
D If this proportion is very close to 50%, then it indicates the presence of leaves where the model is very much undecided on the correct outcome and might indicate poor performance.
Question 42
5 points
Build a hierarchical clustering model with the 5 predictors in Models E, F and G using the training data set. Use the setting of 4 clusters.
Using only training data points in the largest cluster, build a logistic regression model with the 5 predictors and outcome ‘z’. Call this Model H. Report the AUC.
Question 43
3 points
How would you expect your answer to Question 42 to differ from Question 28?
Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.
A The AUC in Model H is higher than the AUC in Model F, because it has fewer data points.
B The AUC in Model H is higher than the AUC in Model F, because the data points used in Model H belong to the same cluster, thus exhibit a more consistent relationship.
C The AUC in Model H is lower than the AUC in Model F, because it has fewer data points.
D The AUC in Model H is lower than the AUC in Model F, because the way data points are selected into Model H is not based on the outcome variable ‘z’.