代写Assignment 2: 161.762 Multivariate Statistics for Big Data代写数据结构语言程序

2025-06-03 代写Assignment 2: 161.762 Multivariate Statistics for Big Data代写数据结构语言程序

Assignment 2: 161.762

Multivariate Statistics for Big Data

Due: 2nd June 2025

Note: Code used for each analysis must be included in your response.

1) [10 marks]  Multidimensional wine

The Wine dataset contains the results of a chemical analysis of wine samples grown in the same region of Italy. While all samples come from a single region, they originate from three distinct grape cultivars. The dataset includes various chemical properties used to assess the composition and quality of the wines.

The measured variables are:

a. Alcohol

b. Malic acid

c. Ash

d. Alkalinity of ash

e. Magnesium

f. Total phenols

g. Flavonoids

h. Nonflavonoid phenols

i. Proanthocyanins

j. Color intensity

k. Hue

l. OD280/OD315 of diluted wines

m. Proline

Note: Do not use the cultivar information (do not use the column "Class") in your analysis for this item.

A) [3 marks] Cluster Analysis

Perform. a cluster analysis. What method did you select? Justify your choice. (Maximum: 100 words). 

B) [3 marks] Metric Multidimensional Scaling (MDS)

Perform. a metric MDS analysis. Did you apply any preprocessing to the dataset? What distance metric did you choose? Justify your decisions. (Maximum: 100 words).

C) [4 marks] Comparison to Cultivars

Does your analysis (A and B) support the existence of three distinct cultivars? Why or why not? Provide a critical interpretation. (Maximum: 200 words).

2) [10 marks] Wine testing

As wine growers are interested in distinguishing between cultivars, they want to know whether it is possible to reliably discriminate between them based on their chemical composition. Carry out a Discriminant Analysis by following the steps below. For each question, justify your answers.

A)  [3 marks] Canonical Discriminant Analysis.

Split the dataset into a training and testing set. Perform. an adequate Discriminant Analysis using all variables. Briefly interpret the canonical functions and the separation achieved between cultivars (maximum: 200 words).

B)  [1 marks] Classification Accuracy

Apply the discriminant model to the testing subset. What is the overall classification error rate? Based on this result, would you trust the model’s performance? Briefly explain (maximum: 100 words).

C)  [2 marks] Stepwise Discriminant Analysis

Conduct a Stepwise Discriminant Analysis on the training set. How many variables are selected, and which ones? Briefly describe the procedure cultivars (maximum: 100 words).

D)  [4 marks] DA with Selected Variables

Using only the subset of variables selected in (c), repeat the Analysis on the training set, and test it on the testing set. Compare the results of the full-variable and stepwise-variable analyses. Which model performs better in terms of error rate or interpretability? Which one would you recommend for future use? Briefly explain (maximum: 200 words).