SD 212 Spring 2026 / Labs


Lab 4: Credit card agreements

1 Overview

Today’s lab is going to look at a whole lot of credit card agreements. Yes, that is the long booklet of important-looking information that no one reads. So let’s get the computer to read them for us!

The data we will use comes from the. Consumer Financial Protection Bureau, which apparently collects the card agreements for all credit cards issued by U.S. banks each quarter. Because these are pdf files instead of plain text, we will use Python’s pypdf library to read them.

1.1 Deadlines

  • Milestone: 2359 on Wednesday, 4 March
  • Complete lab: 2359 on Wednesday, 18 March

1.2 Learning goals

  • Gain experience in error handling from examining real-world data files which may be mis-formatted
  • Use a popular Python library to scrape text from PDF files
  • Use regular expressions to search for key phrases
  • Use Python libraries to create simple visualizations
  • Develop your own questions that can be investigated with data

2 Preliminaries

2.1 Markdown file to fill in

Unlike the last lab, we will not be using Jupyter notebooks for this one. Instead, you will write your code in plain old .py files, and put your answers to the questions in a .md file.

You should see a directory for this lab, along with the file lab04.md , if you run

git pull

in your sd212 directory.

The first two questions you need to fill in are kind of general and apply to the entire lab. They are both optional (except if you used any resources that need to be documented as per the course honor policy).

  1. What sources of help (if any) did you utilize to complete this lab? Please be specific.

  2. What did you think of the lab overall? Again, if you can be specific that is helpful!

2.2 Accessing the datasets

The full dataset available from the CFPB site is a little less than 1GB, which is once again a bit large for you each to download and store separately.

So again, we have downloaded the raw data for you in the following folder on ssh.cs.usna.edu and mounted on all lab machines:

/home/mids/SD212/cc

Although this lab’s dataset has only a few thousand files, each of them is a PDF which takes longer to process, so it will be convenient to have some smaller datasets to get your code working with initially:

  • 1% of original size, about 36 PDFs: /home/mids/SD212/cc.01

  • 10% of original size, about 360 PDFs: /home/mids/SD212/cc.1

3 Python background

3.1 Writing files in Python

We have usually been just reading files in Python programs, but Python can also be used to create new text files pretty easily.

The syntax is like this:

filename = 'somefile.txt'

with open(filename, 'w') as fout:
    print('First line of my file', file=fout)
    n = 2
    print('line', n, 'of my file', file=fout)

Notice three crucial differences from opening a file for reading:

  • We usually put the opening in a with block. That ensures that if your program crashes, the data written so far will actually be saved to the file. Very useful!
  • You pass the 'w' argument to the open() function so that the file is opened in “writing” mode.
  • To add a line of output to the file, use the print() function as normal, but with an extra named parameter file= for the name of your opened file handle.

We have spent a lot of time writing bash commands to crawl through directories, but no so much with Python.

To go through directories and sub-directories in Python, find out whether files exist, etc., we can use the pathlib module.

Normally we represent file and directory names in Python using a normal string, like myfolder/myfile.txt. With pathlib, we can instead use the Path class to represent file and directory names, and then we get some convenient methods:

  • Path(str): Turn a string into a Path object
  • str(p): Turn a Path object back into a regular string
  • p.iterdir(): Loop over the files (and subdirectories) of a given directory path p
  • p.is_dir(): Returns true or false depending on whether p is a directory
  • p.is_file(): Returns true or false depending on whether p is a normal file
  • p.open(): Opens the file named by the Path object p. (It’s the same as doing open(str(p))

For a complete example, here is a Python program which is equivalent to this one-line bash command:

wc -l books/*.txt

The Python version uses pathlib to go through a directory books, open every .txt file in that directory, and print out how many lines that file has. It’s the equivalent of the bash one-liner

from pathlib import Path

booksdir = Path('books')
for filepath in booksdir.iterdir():
    # filepath represents a single file or subfolder inside books/
    if str(filepath).endswith('.txt'):
        # count lines in the file
        handle = filepath.open()
        count = 0
        for line in handle:
            count += 1
        # print out the filename and number of lines
        print(filepath, count)

3.3 Scraping PDF files with pypdf

PDF files are binary files (not plain-text), so we can’t read through them and use string-processing tools like we are used to.

Instead, we can use the pypdf library to extract the text from a PDF file, and then use our normal Python skills to work with that text.

Now let’s see an example of using pypdf to extract the text from MIDREGS. Download the MIDREGS pdf and save it as midregs.pdf from the command line using wget:

wget -O midregs.pdf 'https://www.usna.edu/Commandant/_files/COMDTMIDNINST_5400.7_MIDSHIPMEN_REGULATIONS_MANUAL.pdf'

Then the following Python program will extract the text from each page and tell us which pages contain the word “fun”:

from pypdf import PdfReader

# start by opening the file and creating a PdfReader object
rdr = PdfReader('midregs.pdf')

# go through each page and look for fun
pagenum = 1
for page in rdr.pages:
    # get a regular Python string for all the text on this page of the pdf
    text = page.extract_text()
    if 'fun' in text:
        print(pagenum)
    pagenum += 1

PART 1: Not-so hidden fees (30 pts)

To get started, write a python program fees.py that goes through the PDFs in the 1% dataset cc.01 and counts how many time the word fee or fees appears, in total.

Here’s one way to tackle this:

  • Start with a single PDF file, maybe

    cc.01/FIRST NATIONAL BANK/FNCLR-D-0922(AFTI) - FNCLR-D-0922 (AFTI).pdf

  • Copy the pypdf example above and modify it to work with this file. Run it! In this file, the string “fun” appears on pages 2 and 5.

  • Modify your program so that it uses a regex and counts instances of the string “fee” instead of “fun”. (Look back in your notes for how to use the re library and the findall() method.)

    This file has 22 occurrences of the string “fee”.

    (Note: the page numbers don’t actually matter anymore at this point, so you should be able to simplify your code!)

  • Tweak your regex so that it only matches the entire word “fee” or “fees”, ignoring case.

    There should be 34 occurrences in this example file.

    (Hint: look at the re module documentation to see how you can tell Python to ignore case in a call to findall().)

  • Now loop over all the pdf files in all subdirectories of cc.01, using the pathlib module as in the example above. Get your fees.py program to print out the total count at the end.

    Note: some of the pdf files are improperly formatted and will give an error message when you try to open them with the PdfReader. Use proper error handling so that your program just ignores such files and moves on to the next one.

    Another note: Extracting text from PDF files is kind of slow! Even with the 1% database cc.01, it might take around 1 minute to sccessfully find the total count.

Now fill in this question in the markdown file:

  1. What is the total number of times the word “fee” or “fees” appears in the 1% dataset cc.01?

3.4 Submit what you have so far

Save your files and submit your work so far:

submit -c=sd212 -p=lab04 lab04.md fees.py

or

club -csd212 -plab04 lab04.md fees.py

or use the web interface

3.5 Milestone

For this lab, the milestone means everything up to this point, which includes the following auto-tests:

part1

This milestone is not the half-way point. Keep going!

Part 2: State of the (credit) union (30 pts)

Copy your fees.py program to a new program states.py for this part.

Make a small change so that it finds the states that are mentioned. We want to analyze the locations of the banks that are issuing all of these credit cards. Most of the PDFs contain one or more mailing addresses, presumably corresponding to where the bank is located.

Your states.py program should:

  • Use a regular expression to look through the text of each PDF page for something that looks like a state abbreviation (two capital letters), followed by a single space, followed by a 5-digit zip code.

    Be sure not to include anything else; for example PO BOX 89909 should not count as a state OX, and NMLSR ID 399801 should not count for Idaho since that’s a 6-digit code.

  • Print all the state abbreviations that you find, one per line, with repeats, to a new text file cardstates.txt as you go.

After running your states.py program, you should have a cardstates.txt file. Now count how many times Oregon (OR) appeared. For the 1% dataset, you should get 14. You can easily run a single bash command-line to answer this. Then run the 10% dataset and answer this question:

Now fill in this Question in your MD file:

  1. For the 10% dataset, how many times does the state OR appear?

3.6 Submit what you have so far

Save your files and submit your work so far:

submit -c=sd212 -p=lab04 lab04.md fees.py states.py cardstates.txt

or

club -csd212 -plab04 lab04.md fees.py states.py cardstates.txt

or use the web interface

Part 3: Population control (30 pts)

We want to compare the prevalence of each state in the credit card agreements with the population of that state. For that, we first need another piece of data, the population of each state.

Bottom line: The goal of this part is to write a program tally.py that produces (and prints) a Pandas DataFrame with at least three columns: the state abbreviation, state population, and count of how many times that state was mentioned in the credit card agreement PDFs.

Below I have some steps and suggestions of how to make this DataFrame, but you are encouraged to stop now and try to figure it out your own way!

3.7 Population Data

We have provided you with a file pops.csv in your folder for this lab. This was created for you from Census data.

Important: You need the state populations to be listed by abbreviation, to match up with the cardstates.txt file you already have. This CSV has a column with the state abbreviations.

3.8 Tally up

Create a new python program tally.py. This program should read the big list of state names in your cardstates.txt from the previous part, and turn this into a Pandas DataFrame that has two columns, one for the state names, and one for the count of how many times that state appeared in the CC agreements.

BUT! this is a .txt file not a .csv file. If you read it with pandas as a CSV, pandas will not find a header line that gives the column names, so it will assume the first row is the header. Read the following tips on how to handle this.

  • read_csv:

    You know this function well, but since cardstates.txt is just a list of state names, it’s like a single-column csv file. There is no header name for the single-column. You can specify an option like names=['state'] to tell Pandas there is no header and to give it a nice name:

    df = pd.read_csv(`cardstates.txt', names=['state'])
  • value_counts

    Call this function on a Pandas series to combine entries with the same name and get how many times each label appears in that series.

    What gets returned is a new, smaller series, which is indexed by the original series values, and where series now contains integers counting up how many times each thing occurs.

  • reset_index

    This one is new. What it does is take a single-column Pandas Series and turn it into a two-column DataFrame, where the first column is the index from the series.

    Using reset_index after a call to value_counts can be especially useful!

Your goal here is to get a Pandas DataFrame with a column for the state name and a column for how many times that state appeared in cardstates.txt.

3.9 Combine

Now it’s time to add the logic to your tally.py program so that it combines the state population data with the counts from the credit card agreements.

As usual, there are many possible ways to do this! Here’s one way:

  • At this point you should have a DataFrame with state abbreviations and populations, and a completely separate DataFrame that just has the state names and counts.

  • We want to combine these and “match up” the rows that correspond to the same state. The first thing to do is to get the columns for the state abbreviations to have the same column name in both DataFrames.

    You will need to use the rename method to change your column names so the states are the same. It’s easy:

    df.rename(columns={'Old Name': 'new name'})
  • Next we want to use the Pandas merge operation

    Because this is a really sophisticated function (which we will learn more about later in the semester), I’m going to show you exactly how you want to use it, via a small illustrative example:

  • Make sure you do a .fillna(0) to change the NaNs for states that didn’t have any credit cards, into zero counts as they should be.

3.10 Question

Use your DataFrame to do a very simple analysis and answer this question:

  1. For the 100% data, which state has the highest ratio of credit card agreements per population? Write just the 2-letter abbreviation of the state.

3.11 Submit what you have so far

Save your files and submit your work so far:

submit -c=sd212 -p=lab04 lab04.md fees.py states.py cardstates.txt tally.py

or

club -csd212 -plab04 lab04.md fees.py states.py cardstates.txt tally.py

or use the web interface

Part 4: Graph it (10 pts)

Modify your tally.py program so that it displays a scatterplot with population on the x-axis and the credit card agreement count on the y-axis. So each state should appear as a single dot, and (for example) a state on the bottom-right would have a high population and low amount of credit card companies.

Making the scatterplot using Plotly Express should be very simple if you have your DataFrame worked out from the last part. The second example on this page shows how to make a scatterplot from a dataframe and is a great starting point. A more complete description of the px.scatter() function is here.

Add a trendline to your scatterplot (look at the documentation pages above to find the right option). The trendline will show sort of the “average” relationship between state population and credit card agreements.

Once you are happy with how your graph looks, save it to a file called scatter.png. Then answer two final questions:

  1. What are the most significant outliers in the dataset? In your graph, these would be the states that are farthest from the trendline. What states are way above or below the “typical” population-scaled average?

  2. Choosing one outlier state you identified from the previous problem, try to do some quick research to make a plausible explanation of why that state has so many or so few credit card companies relative to its population.

    (For example, some states have different tax laws, lending regulations, or court systems that may make them more or less attractive to the credit card companies.)

3.12 Submit your work

Save your files and submit your work:

submit -c=sd212 -p=lab04 lab04.md fees.py states.py cardstates.txt tally.py scatter.png

or

club -csd212 -plab04 lab04.md fees.py states.py cardstates.txt tally.py scatter.png

or use the web interface