DTS002TC Essential of Big Data
Coursework 2 (Individual Assessment)
Due: 5:00 pm China time (UTC+8 Beijing) on Sat. 24th. May. 2025
Weight: 50%
Maximum score: 100 marks (100 % individual marks)
Assessed learning outcomes:
E. Demonstrate the ability to write codes to obtain numerical solutions to mathematical problems.
F. Demonstrate the ability to display computational results in tabulated or graphical forms. Develop an understanding of the industrial and commercial applications of big data.
Late policy: 5%of the total marks available for the assessment shall be deducted from the assessment mark for each working day after the submission date, up to a maximum of five working days.
Risks:
l Please read the coursework instructions and requirements carefully. Not following these instructions and requirements may result in loss of marks.
l Plagiarism results in award of ZERO mark.
l The formal procedure for submitting coursework at XJTLU is strictly followed. Submission link on Learning Mall will be provided in due course.The submission time stamp on Learning Mall will be used to check late submission.
Overview
This coursework aims to provide students with hands-on experience in analyzing and predicting global electricity generation data using Python. Students will be required to perform. data reading, preprocessing, prediction, visualization, and validation of their predictions against real-world data for selected countries. This exercise will help students understand the practical applications of big data analytics in the electricity sector and enhance their skills in data manipulation, visualization, and predictive modeling.
Task 1: Data Processing and Analysis (40 marks)
1.1 Data Reading and Preprocessing (15 marks)
Using Python, perform. the following tasks:
a. Import the necessary libraries (e.g., pandas, numpy). (3 Marks)
b. Load the GlobalElectricityStatistics.csv dataset into a DataFrame. named electricity_data. (3 Marks)
c. Display and check the first and last five rows of the DataFrame. (3 Marks)
d. Show the basic information of the DataFrame, including dimensions, column details, data types, and memory usage. (3 Marks)
e. Handle any missing values or inconsistencies in the data. (3 Marks)
1.2 Data Visualization (10 marks)
Visualize the electricity generation trends for five selected countries (e.g., China, United States, Ireland, South Africa, India) from 1980 to 2021. Using Python, perform. the following tasks:
a. Plot line charts for each country showing the electricity net generation over the years. (5 Marks)
b. Use appropriate titles, labels, and legends to make the charts readable. (5 Marks)
1.3 Data Aggregation and Summary (15 marks)
Using Python, perform. the following tasks:
a. Calculate and display the average annual electricity generation for each country from 2000 to 2021. (5 marks)
b. Identify and display the country with the highest and lowest average electricity generation during this period. (5 marks)
c. Visualize the comparison of average electricity generation among the selected countries using a bar chart. (5 marks)
Task 2: Predictive Modeling and Discussion (60 marks)
"Net Consumption" refers to the total amount of electricity that is actually used by consumers within a specific area or country over a given period.
Net Consumption is calculated by considering the following components:
Net Generation: The total amount of electricity generated within the area, minus the electricity used by the power plants themselves (e.g., for plant operations).
Imports: The amount of electricity imported from other regions or countries.
Exports: The amount of electricity exported to other regions or countries.
Distribution Losses: The amount of electricity lost during transmission and distribution.
The formula for Net Consumption is:
Net Consumption=Net Generation + Imports − Exports − Distribution Losses
2.1 Data Preparation for Prediction (15 marks)
Using Python, perform. the following tasks:
a. Calculate Net Consumption value for each country a country from 1980 to 2021. (5 marks)
b. Select the country with the highest average Net Consumption among all countries from 1980 to 2021. (5 marks)
c. Split the Net Consumption data of selected country into training and testing sets (e.g., 80% training, 20% testing). (5 marks)
2.2 Model Building and Prediction (15 marks)
Using Python, perform. the following tasks:
a. Initialize a suitable predictive model with possible parameters (e.g., linear regression, Naive Bayes). (5 marks)
b. Train the model using the Net Consumption training data of selected country. (5 marks)
c. Predict the Net Consumption for the years 2022 to 2024 for the selected country. (5 marks)
2.3 Validation Against Real Data (15 marks)
a. Use internet resources to find the actual Net Consumption data for the selected country for the years 2022 to 2024 with python. (5 marks)
b. Compare the predicted values with the actual values. Calculate the percentage error for each year with python. (5 marks)
c. Discuss possible reasons for any discrepancies between the predicted and actual values within 200 words. (5 marks)
2.4 Analysis and Conclusion (15 marks)
a. Summarize the findings from the predictive modeling and validation within 150 words. (5 marks)
b. Provide insights on how big data analytics can be applied to improve electricity generation planning and management within 150 words. (5 marks)
c. Provide insights on how big data analytics can be applied to improve other similar scenarios within 150 words. (5 marks)
Submission Format Instructions
The assignment must be typed, spell-checked, referenced, and submitted via Learning Mall Online to the correct dropbox.
Only electronic submissions are accepted - no hard copies:
l A Student_ID.pdf file contains a cover letter with your ID information, and all the task report content.
All students must download their file and check that it is viewable after submission. Document uploads may become corrupted during the uploading process (e.g., due to slow internet connections). Therefore, students themselves are responsible for submitting a functional and correct file that needs to be tested after submitting it.
Overall Marking Criteria
Code Quality and Implementation Results
Outstanding (100%): Code is exceptionally well-organized, readable, and well-commented. Implementation results are accurate and demonstrate a deep understanding of the concepts. All tasks are completed with high precision.
Appropriate (80%): Code is generally well-organized and readable. Implementation results are accurate and meet the requirements. Most tasks are completed effectively.
Needs Improvement (60%): Code is somewhat disorganized or poorly commented. Implementation results are mostly accurate but may have minor errors. Some tasks are incomplete or not fully addressed.
Hard to Understand (40%): Code is difficult to follow or lacks clarity. Implementation results are inaccurate or incomplete. Many tasks are not fully addressed or have significant errors.
No Submission or Missing Section (0%): No submission or critical sections of the assignment are missing.
Data Processing and Analysis (Task 1)
Outstanding (100%): Data reading, preprocessing, visualization, and aggregation are performed flawlessly. Results are presented clearly and accurately. All subtasks are completed with high precision.
Appropriate (80%): Data processing and analysis are generally well-executed. Results are mostly accurate and meet the requirements. Most subtasks are completed effectively.
Needs Improvement (60%): Data processing and analysis show some inaccuracies or inconsistencies. Results are partially accurate but may have minor errors. Some subtasks are incomplete.
Hard to Understand (40%): Data processing and analysis are poorly executed. Results are inaccurate or incomplete. Many subtasks are not fully addressed or have significant errors.
No Submission or Missing Section (0): No submission or critical sections of the assignment are missing.
Predictive Modeling and Discussion (Task 2)
Outstanding (100%): Predictive modeling and validation are performed with high precision. Model choice is well-justified, training is accurate, and predictions are reliable. All subtasks are completed with high precision.
Appropriate (80%): Predictive modeling and validation are generally well-executed. Model choice is justified, training is accurate, and predictions are mostly reliable. Most subtasks are completed effectively.
Needs Improvement (60%): Predictive modeling and validation show some inaccuracies or inconsistencies. Model choice may not be fully justified, training may have minor errors, and predictions may be less reliable.
Hard to Understand (40%): Predictive modeling and validation are poorly executed. Model choice is unclear, training is inaccurate, and predictions are unreliable.
No Submission or Missing Section (0%): No submission or critical sections of the assignment are missing.
Analysis and Conclusion
Outstanding (100%): Analysis is thorough and insightful. Conclusions are well-supported by the data and results. Insights are relevant and demonstrate a deep understanding of the topic. Summary and discussion are concise and clear.
Appropriate (80%): Analysis is generally thorough. Conclusions are supported by the data and results. Insights are relevant and demonstrate a good understanding of the topic. Summary and discussion are clear.
Needs Improvement (60%): Analysis is somewhat superficial. Conclusions may lack full support from the data. Insights are partially relevant. Summary and discussion may lack clarity.
Hard to Understand (40%): Analysis is incomplete or unclear. Conclusions are not well-supported by the data. Insights are irrelevant or unclear. Summary and discussion are difficult to understand.
No Submission or Missing Section (0%): No submission or critical sections of the assignment are missing.