Lab 6: Final project

1 Overview
2 Downloading and exploring the data
3 Deliverables & Milestones
4 Grading Breakdown

1 Overview

Objective: Construct an automated data pipeline using bash and Python to merge federal spending data with an external dataset, analyze the combined data, and present a data-driven policy recommendation. All collaboration and submissions will be managed via a shared GitHub repository.

Timeframe: All class and lab time next week will be spent working on this project. There are specific milestones to reach every day, as detailed below, in order to reach the end goal of an excellent data-driven presentation in two weeks’ time.

You should expect to spend considerable time outside of class and lab in order to do well on this project. Think about all the time you spend working on labs and homeworks outside of class in a normal week of SD212. You should expect to spend a similar amount of time (or more) on this project.

Groups: You will be assigned to groups of 2 or 3 within your section. Everyone in the group is expected to contribute significantly, but not necessarily in the same way. The grade for your work will be the same for all group members, but 20% of your grade will be based on our teamwork rubric, judging how well you individually helped the team meet its goals.

Ground rules: You can use the SD212 Gemini gem but no other AI tools. You can help each other because you should all be working on different topics and different secondary datasets. Regardless of any help you receive, you will be expected to understand and explain your own work in detail. If in doubt, just ask your instructor — we want you to succeed!

Positive examples:

Here are two examples of great data-driven presentations given by students that I found online:

“Region of Boom”: A team of 4 studied housing data, combining with information about the city and historical trends, to make a tool to predict future housing costs.

This is definitely a bit more sophisticated than what you will be able to do as SD212 students (specifically some of the machine learning techniques they applied), but it’s an excellent, clear presentation, and required them to think about finding and combining multiple data sources.
“Cyclistic” case study by Camille Johnson: This was a team of 1 but is another great and effective data-driven presentation, using the kind of pandas and data cleaning tools that you should be familiar with as SD212 students. But note that, unlike for this project, this problem did not include coming up with the question to ask, or incorporating an outside data source.

2 Downloading and exploring the data

Your starting point will be usaspending.gov. This website gives anyone free access to information on everything the U.S. federal government gives money to (except for classified expenditures), above a certain minimum threshold. So every contract, grant, or equipment purchase, from the Air Force to the National Parks service, is in there somewhere.

The dataset is massive, much larger than you would even be able to download. Instead, what you want to do is use the web interface to download a csv that contains just part of the data. The interface allows you to filter by agency, date range, and location. To get started, try getting just one week’s worth of data:

Go to https://www.usaspending.gov
Navigate to “Download the Data” and “Custom Award Data”
Select Prime Awards, select “All” agencies, “All” locations, and then ask for a “last modified date” within some one-week date range from 2025.
Click the “Download” button and wait patiently.
After a few minutes, you should download a .zip file that contains the csv(s) you requested
While you wait, open another tab on usaspending.gov and navigate to “Find Resources” and “Data Dictionary”. This document will be crucial to you understanding what you are looking at. The csv you are downloading has hundreds of columns. This “data dictionary” tells you the meaning of what is in each column.
When the download finishes, find that .zip file and unzip it to a temp directory where you can explore. The file you probably want to start with is named like All_Contracts_PrimeTransactions_2026-04-08_H14M24S58_1.csv. I recommend renaming this file to something easier to type and moving it to a folder so you can get to work.

One week’s worth of spending will already be pretty massive, maybe 150,000 rows and 300 columns. This should be good for you to get started exploring. Later you will want to come back and probably select a wider date range but a more targeted selection of agency, location, or award type.

Now start exploring this dataset! Your goal is to try to find something that might be interesting here, and to think of ideas for what topic you might want to explore for your project. I’m not going to tell you exactly how to do that, but here are some thoughts:

You might want to start a jupyter notebook (ipynb file), which is good for messing around without having to keep reloading the data, and for remembering what you’ve done when you find something cool.
On the command line, or in a bash cell in your jupyter notebook, you can try grep to look for something of interest across all the rows and columns. For instance, try searching for “NAVAL ACADEMY”.
From grep you’ll see the whole row with hundreds of columns, which is hard to decipher. So fire up python and load your csv with pandas. Try to pick out one entry from the dataframe, maybe using one of the id fields that are at the beginning of the line.
Look at the entire series for that one row. There are hundreds of columns — you are trying to identify which one(s) might be interesting. Refer to the data dictionary from usaspending.gov if needed to understand what you are seeing.
Iterate this process, looking for different things in the data, trying to narrow in on a topic of interest that will be unique to what you are interested in investigating.

When you start to narrow on a topic, start to think about your secondary data source that will be combined with what you are getting from usaspending.gov. Keep an open mind and be flexible here; you will probably need to explore a few different ideas before you find one that’s workable.

3 Deliverables & Milestones

3.1 Milestone 1: Repository Setup & Project Proposals

Due: 23:59 on Monday, April 13

Repository Setup: Create a private GitHub repository. Add all team members and your instructor as collaborators.
README.md: Initialize a README.md at the root of your repository. List all team members and state your chosen team name (which must exactly match your repository name).
Registration: Complete this Google Form with your team name and the URL to your GitHub repository. (This is the only form you will submit for the project).
Proposals: Submit N distinct project ideas, where N is the number of students in your group.
- Create a directory named ideas/ at the root of your repo.
- For each idea, create a markdown file named idea_1.md, idea_2.md, etc.
- Each file must document: The exact subset of usaspending.gov data, the proposed external dataset, the specific columns in both datasets that will be of interest and used to combine them, and the proposed policy question/novel conclusion.

3.2 Milestone 2: Technical Proof of Concept & Final Selection

Due: 23:59 on Wednesday, April 15

Select your final project idea and prove the data can be successfully merged.

Raw Data: Download your raw datasets manually and place them unmodified into a data/raw/ directory at the root of your repo.

Important notes:
- After Milestone 2, any changes to files in data/raw/ are strictly prohibited without explicit permission from your instructor. GitHub version history will be used to verify data immutability.
- GitHub restricts standard file commits to 100MB, and free accounts have a 2GB total limit for Git Large File Storage (LFS). You will likely need to configure Git LFS to commit your raw data files.
Proof of Concept Script: Create a directory named src/ at the root of your repo. Submit src/poc_merge.py. This script must join a sample (e.g., 1,000 rows) of the datasets and explicitly print:
- The exact keys used for the merge.
- The number of rows in the original USASpending dataset sample.
- The number of rows in the original external dataset sample.
- The number of rows in the final merged dataset.

3.3 Milestone 3: Pipeline and Analysis Completion

Due: 23:59 on Monday, April 20

Finalize your codebase and establish your thesis.

Dependencies: Add a pixi.toml file to the root of your repository explicitly listing any Python library requirements needed to execute your code.
Automated Pipeline: Submit a single pipeline.sh script at the repository root. Executing this bash script must run your entire workflow: reading from data/raw/, calling your Python cleaning/merging scripts in src/, and outputting the final structured summary tables to a new data/clean/ directory.
Thesis: Add a new section to your main README.md stating the final conclusion or policy recommendation you will present based on these results. You should be able to state your thesis in just a few sentences.

3.4 Milestone 4: Visualizations & Slide Deck

Due: 23:59 on Wednesday, April 22

Create the slide deck for your final presentation.

Visualizations: Must be generated programmatically from your data/clean/ outputs using Plotly. Include at least two distinct types of interactive charts (e.g., a choropleth map and a scatter/time-series plot).
Slide Deck: Create your presentation using Google Slides. Ensure the viewing permissions allow your instructor to see it, and add the link to your slide deck to your main README.md. The slides must integrate your visualizations and directly support your policy thesis.

3.5 Milestone 5: Final Presentation

Due: During lab time on Thursday, April 23

Deliver a 10-minute briefing to the class and a judging panel.

Audience: Your judges will include your instructor and at least one outside data science expert who is unfamiliar with your specific topic or dataset. Tailor your presentation to an audience knowledgeable about data science but not your specific domain.
Content: Defend your technical data-cleaning decisions (handling NaN values, resolving misaligned keys, accounting for dropped rows). Present your visualizations and policy recommendations. Be prepared for technical and analytical Q&A.

4 Grading Breakdown

20%: Milestones 1–4

Evaluated based on timeliness, adherence to the strict file path/naming conventions, and meeting the technical requirements of the intermediate pipeline checks.
20%: Teamwork and Collaboration

Evaluated via peer assessment. Each student will complete a confidential evaluation of their group members based on the USNA CS Department Teamwork Rubric.
60%: Final Product & Presentation

Evaluated using criteria adapted from the UMD Info Challenge 2024 rubric. Each category is scored on a 1 to 10 scale, where 1 means “needs more work”, 5 means “minimally fits the assignment requirements”, and 10 means “exceptional; goes above and beyond the basic requirements”.

Your project and presentation will be independently evaluated by your instructor the outside expert who visits your section. Final interpretation of scores for assigning grades will be at the discretion of your instructor.

Project Evaluation Criteria:
- Results address the questions or problems posed by the Challenge Provider (the team’s thesis).
- Project is well structured and organized.
- Solutions are realistic.
- Outcome is enhanced by additional information.
- Visualizations help to clearly convey the team’s findings.
- Students demonstrate knowledge of the project domain.
Presentation Evaluation Criteria:
- Completed within time limit.
- Presented as a logical and engaging story.
- Audience questions answered clearly and accurately.
- All students participated.
Overall Evaluation:
- Each judge’s Overall Evaluation Score (1-10)

SD 212 Spring 2026 / Labs