DSCI 510 – Spring 2025
Project Report 2 [ 55 points ]
Due: May 4, 2025 11:59pm PT.
Submission
Submit a link to your commit on GitHub in the https://brightspace.usc.edu submission. Link to your commit looks like this (example):
https://github.com/atregubov/dsci510hw2025/commit/f37245fa79ff8df0ab8f1d160055f6 6adc94775
Final Project GitHub Repo Guidlines.
You have two options to store your project on GitHub:
1. Use your existing DCSI510 repository and create a directory “final_project” for your final project
2. Create a new repository and share it with us (use our github ids: SrinathBegudem and atregubov )
You project repository should have the following directory structure:
● src/ - folder for your source code, python files
○ config.py - a configuration file, where you can store paths, urls to your data or APIs.
● docs/ - folder for you slides/presentation
○ docs/<first–name>– <last–name>.pdf your final version of the slides
● data/ - very small sample to test data (e.g. files with no more than 20-50 lines). DO NOT put any of your data in the repository !!! Grades will be deducted if you upload your actual data!
● .gitignore - can use a template from here
https://github.com/github/gitignore/blob/main/Python.gitignore
● requirements.txt - a list of libraries you use, more on this here
https://www.freecodecamp.org/news/python-requirementstxt-explained/
● README.md - a text file with your project description
● results.ipynb - a jupyter notebook that runs your pipeline and analysis. Put all your python code in python files (.py) in src/ folder and only call functions (e.g. to get data, draw charts) from the notebook.
Please add the following sections in your README.md:
● Introduction - describe your project
● Data sources - update and use the table from your project report 1 (skip the column “Have tried to access/collect data with python?”).
● Analysis - describe type of analysis you do
● Summary of the results - you can leave this section empty if you don’t have results yet. Update it as you complete your project
● How to run - describe how to run your pipeline and reproduce results of your work/analysis, including fetching the data. We should be able to reproduce data loading/fetching and data processing. Describe what API keys for what services we need to have, don’t put your API keys here or anywhere in the repository. This section may not be fully complete at this point, but you need to provide the initial draft.
Main grading criteria - reproducibility. We should be able to follow your instructions in README.md and re-run your code and achieve the same/similar results (charts, plots, etc.) as you demonstrated in class.
If your project collects large amounts of data (e.g. you spend days or weeks collecting it), you should configure it to run on a smaller sample size data, so we could re-run it within 1-2 hours. Ask your friends to test your project before submission.