CS39A0190 Data analytics basics

2024-01-27 CS39A0190 Data analytics basics

ASSIGNMENT 2 INSTRUCTIONS

CS39A0190 Data analytics basics


2024-01-15

”DATASETS”-file has three sheets. First sheet ”DATASET1” is its own task and sheets two

and three, ”DATASET2_1” and ”DATASET2_2” are another one.

The datasets needs to be cleaned before they can be analysed. You can turn the data into

table or keep as a range, whichever you find easier to work with.

DATASET1 basic info

The data is process data where Events go through a process from A to G.

Process step colums A to G have dates as data. The date depicts when the process step began. Process

step G means completion of the whole process.

DATASET2 basic info

The data is sales data for product range. SKU (stock keeping unit) is assigned to each unique product.

Agency depicts a retailer. Volume is the amount sold during the time and average max temp describes

the temperature at the retailer for the sales period.

DATASETS

Cleaning the data

Remove empty rows.

Remove incomplete Event records. Every process step must have a date.

Remove duplicate Event records. Each Event ID is unique.

If Excel does not recognize the process step data as dates, convert the cell contents into dates.

Event IDs and Event types are numbers by format.

Analysing the data

Calculate the duration (in days) between each process step and the total duration for all Events.

Calculate the average duration between each process step and the average total process duration for all

Event types.

Example for a single Event:

DATASET1 – 3 POINTS

First, use Text to Columns to get the comma separated data into separate columns.

Cleaning the data

Both DATASET2_1 and DATASET2_2 sheets have a column that uses dot as a decimal separator. If your

Excel has comma as a decimal separator, convert the dots to commas.

Check for duplicates, missing data, empty rows... and handle accordingly if found.

Analysing the data

Add the Avg_Max_Temp data from DATASET2_2 to DATASET_1. See example below.

Calculate monthly total sales (all SKUs) for each Agency.

Calculate annual total sales for each SKU.

DATASET2 – 3 POINTS

Data is from Eurostat, population data. Actual national population data from 2021 and

projections for 2022-2100.

Projections has multiple scenarios in separate sheets.

Your task is to combine the data for EU countries + Iceland, Norway and Switzerland (30

countries in total).

Open a blank Excel spreadsheet and connect to both of the Excel files to get data.

Clean the data, merge the datasets, and clean the merged data if necessary.

You will return results only for the 30 countries.

As a result, you will have for each country a 2021 actual population (total, males, and

females) and projections for 2022-2100 (total, males, and females).

From projections, use two scenarios. Baseline and a free choice for the second.

POWER QUERY – 6 POINTS

Find two datasets (from anywhere) that you can connect to and combine in Power Query.

Use two different sources for the data.

E.g., one from CSV and one from Web

Clean the data using Power Query and combine them into one dataset.

FIND AND MERGE – 3 POINTS

Return three Excel-files.

DATASETS

POWER QUERY

FIND AND MERGE

All used functions and Queries must remain in the submitted files. Being able to see the

data transformation process in the files is more important than the end result.

SUBMISSION