Assignment 2: 161.762
Multivariate Statistics for Big Data
Due: 2nd June 2025
Note: Code used for each analysis must be included in your response.
1) [10 marks] Multidimensional wine
The Wine dataset contains the results of a chemical analysis of wine samples grown in the same region of Italy. While all samples come from a single region, they originate from three distinct grape cultivars. The dataset includes various chemical properties used to assess the composition and quality of the wines.
The measured variables are:
a. Alcohol
b. Malic acid
c. Ash
d. Alkalinity of ash
e. Magnesium
f. Total phenols
g. Flavonoids
h. Nonflavonoid phenols
i. Proanthocyanins
j. Color intensity
k. Hue
l. OD280/OD315 of diluted wines
m. Proline
Note: Do not use the cultivar information (do not use the column "Class") in your analysis for this item.
A) [3 marks] Cluster Analysis
Perform. a cluster analysis. What method did you select? Justify your choice. (Maximum: 100 words).
B) [3 marks] Metric Multidimensional Scaling (MDS)
Perform. a metric MDS analysis. Did you apply any preprocessing to the dataset? What distance metric did you choose? Justify your decisions. (Maximum: 100 words).
C) [4 marks] Comparison to Cultivars
Does your analysis (A and B) support the existence of three distinct cultivars? Why or why not? Provide a critical interpretation. (Maximum: 200 words).
2) [10 marks] Wine testing
As wine growers are interested in distinguishing between cultivars, they want to know whether it is possible to reliably discriminate between them based on their chemical composition. Carry out a Discriminant Analysis by following the steps below. For each question, justify your answers.
A) [3 marks] Canonical Discriminant Analysis.
Split the dataset into a training and testing set. Perform. an adequate Discriminant Analysis using all variables. Briefly interpret the canonical functions and the separation achieved between cultivars (maximum: 200 words).
B) [1 marks] Classification Accuracy
Apply the discriminant model to the testing subset. What is the overall classification error rate? Based on this result, would you trust the model’s performance? Briefly explain (maximum: 100 words).
C) [2 marks] Stepwise Discriminant Analysis
Conduct a Stepwise Discriminant Analysis on the training set. How many variables are selected, and which ones? Briefly describe the procedure cultivars (maximum: 100 words).
D) [4 marks] DA with Selected Variables
Using only the subset of variables selected in (c), repeat the Analysis on the training set, and test it on the testing set. Compare the results of the full-variable and stepwise-variable analyses. Which model performs better in terms of error rate or interpretability? Which one would you recommend for future use? Briefly explain (maximum: 200 words).