Lab 4: Credit card agreements
1 Overview

Today’s lab is going to look at a whole lot of credit card agreements. Yes, that is the long booklet of important-looking information that no one reads. So let’s get the computer to read them for us!
The data we will use comes from the. Consumer Financial Protection Bureau, which apparently collects the card agreements for all credit cards issued by U.S. banks each quarter. Because these are pdf files instead of plain text, we will use Python’s pypdf library to read them.
1.1 Deadlines- Milestone: 2359 on Wednesday, 4 March
- Complete lab: 2359 on Wednesday, 18 March
1.2 Learning goals
- Gain experience in error handling from examining real-world data files
which may be mis-formatted
- Use a popular Python library to scrape text from PDF files
- Use regular expressions to search for key phrases
- Use Python libraries to create simple visualizations
- Develop your own questions that can be investigated with data
2 Preliminaries
2.1 Markdown file to fill in
- Gain experience in error handling from examining real-world data files which may be mis-formatted
- Use a popular Python library to scrape text from PDF files
- Use regular expressions to search for key phrases
- Use Python libraries to create simple visualizations
- Develop your own questions that can be investigated with data
2 Preliminaries
2.1 Markdown file to fill in
Unlike the last lab, we will not be using Jupyter notebooks for
this one. Instead, you will write your code in plain old .py files,
and put your answers to the questions in a .md file.
You should see a directory for this lab, along with the file lab04.md , if you run
git pull
in your sd212 directory.
The first two questions you need to fill in are kind of general and apply to the entire lab. They are both optional (except if you used any resources that need to be documented as per the course honor policy).
What sources of help (if any) did you utilize to complete this lab? Please be specific.
What did you think of the lab overall? Again, if you can be specific that is helpful!
2.2 Accessing the datasets
The full dataset available from the CFPB site is a little less than 1GB, which is once again a bit large for you each to download and store separately.
So again, we have downloaded the raw data for you in the following
folder on ssh.cs.usna.edu and mounted on all lab machines:
/home/mids/SD212/cc
Although this lab’s dataset has only a few thousand files, each of them is a PDF which takes longer to process, so it will be convenient to have some smaller datasets to get your code working with initially:
1% of original size, about 36 PDFs:
/home/mids/SD212/cc.0110% of original size, about 360 PDFs:
/home/mids/SD212/cc.1
3 Python background
3.1 Writing files in Python
We have usually been just reading files in Python programs, but Python can also be used to create new text files pretty easily.
The syntax is like this:
filename = 'somefile.txt'
with open(filename, 'w') as fout:
print('First line of my file', file=fout)
n = 2
print('line', n, 'of my file', file=fout)
Notice three crucial differences from opening a file for reading:
- We usually put the opening in a
withblock. That ensures that if your program crashes, the data written so far will actually be saved to the file. Very useful! - You pass the
'w'argument to theopen()function so that the file is opened in “writing” mode. - To add a line of output to the file, use the
print()function as normal, but with an extra named parameterfile=for the name of your opened file handle.
3.2 Navigating directories in Python with pathlib
We have spent a lot of time writing bash commands to crawl through directories, but no so much with Python.
To go through directories and sub-directories in Python, find out whether files exist, etc., we can use the pathlib module.
Normally we represent file and directory names in Python using a normal
string, like myfolder/myfile.txt. With pathlib, we can instead use
the Path class to represent file and directory names, and then we get
some convenient methods:
Path(str): Turn a string into a Path objectstr(p): Turn a Path object back into a regular stringp.iterdir(): Loop over the files (and subdirectories) of a given directory pathpp.is_dir(): Returns true or false depending on whetherpis a directoryp.is_file(): Returns true or false depending on whetherpis a normal filep.open(): Opens the file named by the Path objectp. (It’s the same as doingopen(str(p))
For a complete example, here is a Python program which is equivalent to this one-line bash command:
wc -l books/*.txt
The Python version uses pathlib to
go through a directory books, open every .txt file in that
directory, and print out how many lines that file has. It’s the
equivalent of the bash one-liner
from pathlib import Path
booksdir = Path('books')
for filepath in booksdir.iterdir():
# filepath represents a single file or subfolder inside books/
if str(filepath).endswith('.txt'):
# count lines in the file
handle = filepath.open()
count = 0
for line in handle:
count += 1
# print out the filename and number of lines
print(filepath, count)3.3 Scraping PDF files with pypdf
PDF files are binary files (not plain-text), so we can’t read through them and use string-processing tools like we are used to.
Instead, we can use the pypdf library to extract the text from a PDF file, and then use our normal Python skills to work with that text.
Now let’s see an example of using pypdf to extract the text from
MIDREGS. Download the MIDREGS pdf and save it as
midregs.pdf from the command line using wget:
wget -O midregs.pdf 'https://www.usna.edu/Commandant/_files/COMDTMIDNINST_5400.7_MIDSHIPMEN_REGULATIONS_MANUAL.pdf'
Then the following Python program will extract the text from each page and tell us which pages contain the word “fun”:
from pypdf import PdfReader
# start by opening the file and creating a PdfReader object
rdr = PdfReader('midregs.pdf')
# go through each page and look for fun
pagenum = 1
for page in rdr.pages:
# get a regular Python string for all the text on this page of the pdf
text = page.extract_text()
if 'fun' in text:
print(pagenum)
pagenum += 1PART 1: Not-so hidden fees (30 pts)
To get started, write a python program fees.py that goes through the
PDFs in the 1% dataset cc.01 and counts how many time the word fee
or fees appears, in total.
Here’s one way to tackle this:
Start with a single PDF file, maybe
“
cc.01/FIRST NATIONAL BANK/FNCLR-D-0922(AFTI) - FNCLR-D-0922 (AFTI).pdf”Copy the
pypdfexample above and modify it to work with this file. Run it! In this file, the string “fun” appears on pages 2 and 5.Modify your program so that it uses a regex and counts instances of the string “fee” instead of “fun”. (Look back in your notes for how to use the
relibrary and thefindall()method.)This file has 22 occurrences of the string “fee”.
(Note: the page numbers don’t actually matter anymore at this point, so you should be able to simplify your code!)
Tweak your regex so that it only matches the entire word “fee” or “fees”, ignoring case.
There should be 34 occurrences in this example file.
(Hint: look at the re module documentation to see how you can tell Python to ignore case in a call to
findall().)Now loop over all the pdf files in all subdirectories of
cc.01, using thepathlibmodule as in the example above. Get yourfees.pyprogram to print out the total count at the end.Note: some of the pdf files are improperly formatted and will give an error message when you try to open them with the
PdfReader. Use proper error handling so that your program just ignores such files and moves on to the next one.Another note: Extracting text from PDF files is kind of slow! Even with the 1% database
cc.01, it might take around 1 minute to sccessfully find the total count.
Now fill in this question in the markdown file:
- What is the total number of times the word “fee” or “fees” appears
in the 1% dataset
cc.01?
3.4 Submit what you have so far
Save your files and submit your work so far:
submit -c=sd212 -p=lab04 lab04.md fees.pyor
club -csd212 -plab04 lab04.md fees.pyor use the web interface
3.5 Milestone
For this lab, the milestone means everything up to this point, which includes the following auto-tests:
part1
This milestone is not the half-way point. Keep going!
Part 2: State of the (credit) union (30 pts)
Copy your fees.py program to a new program states.py for this part.
Make a small change so that it finds the states that are mentioned. We want to analyze the locations of the banks that are issuing all of these credit cards. Most of the PDFs contain one or more mailing addresses, presumably corresponding to where the bank is located.
Your states.py program should:
Use a regular expression to look through the text of each PDF page for something that looks like a state abbreviation (two capital letters), followed by a single space, followed by a 5-digit zip code.
Be sure not to include anything else; for example
PO BOX 89909should not count as a stateOX, andNMLSR ID 399801should not count for Idaho since that’s a 6-digit code.Print all the state abbreviations that you find, one per line, with repeats, to a new text file
cardstates.txtas you go.
After running your states.py program, you should have a cardstates.txt file. Now count how many times Oregon (OR) appeared. For the 1% dataset, you should get 14. You can easily run a single bash command-line to answer this. Then run the 10% dataset and answer this question:
Now fill in this Question in your MD file:
- For the 10% dataset, how many times does the state
ORappear?
3.6 Submit what you have so far
Save your files and submit your work so far:
submit -c=sd212 -p=lab04 lab04.md fees.py states.py cardstates.txtor
club -csd212 -plab04 lab04.md fees.py states.py cardstates.txtor use the web interface
Part 3: Population control (30 pts)
We want to compare the prevalence of each state in the credit card agreements with the population of that state. For that, we first need another piece of data, the population of each state.
Bottom line: The goal of this part is to write a program tally.py
that produces (and prints) a Pandas DataFrame with at least three
columns: the state abbreviation, state population, and count of how
many times that state was mentioned in the credit card agreement PDFs.
Below I have some steps and suggestions of how to make this DataFrame, but you are encouraged to stop now and try to figure it out your own way!
3.7 Population Data
We have provided you with a file pops.csv in your folder for this lab.
This was created for you from
Census data.
Important: You need the state populations to be listed by abbreviation, to match up with the cardstates.txt file you already have. This CSV has a column with the state abbreviations.
3.8 Tally up
Create a new python program tally.py. This program should read the big
list of state names in your cardstates.txt from the previous part, and
turn this into a Pandas DataFrame that has two columns, one for the
state names, and one for the count of how many times that state appeared
in the CC agreements.
BUT! this is a .txt file not a .csv file. If you read it with pandas as a CSV, pandas will not find a header line that gives the column names, so it will assume the first row is the header. Read the following tips on how to handle this.
-
You know this function well, but since
cardstates.txtis just a list of state names, it’s like a single-column csv file. There is no header name for the single-column. You can specify an option likenames=['state']to tell Pandas there is no header and to give it a nice name:df = pd.read_csv(`cardstates.txt', names=['state']) -
Call this function on a Pandas series to combine entries with the same name and get how many times each label appears in that series.
What gets returned is a new, smaller series, which is indexed by the original series values, and where series now contains integers counting up how many times each thing occurs.
-
This one is new. What it does is take a single-column Pandas Series and turn it into a two-column DataFrame, where the first column is the index from the series.
Using
reset_indexafter a call tovalue_countscan be especially useful!
Your goal here is to get a Pandas DataFrame with a column for the state name and a column for how many times that state appeared in cardstates.txt.
3.9 Combine
Now it’s time to add the logic to your tally.py program so that it
combines the state population data with the counts from the credit card
agreements.
As usual, there are many possible ways to do this! Here’s one way:
At this point you should have a DataFrame with state abbreviations and populations, and a completely separate DataFrame that just has the state names and counts.
We want to combine these and “match up” the rows that correspond to the same state. The first thing to do is to get the columns for the state abbreviations to have the same column name in both DataFrames.
You will need to use the rename method to change your column names so the states are the same. It’s easy:
df.rename(columns={'Old Name': 'new name'})Next we want to use the Pandas merge operation
Because this is a really sophisticated function (which we will learn more about later in the semester), I’m going to show you exactly how you want to use it, via a small illustrative example:

Make sure you do a
.fillna(0)to change the NaNs for states that didn’t have any credit cards, into zero counts as they should be.
3.10 Question
Use your DataFrame to do a very simple analysis and answer this question:
- For the 100% data, which state has the highest ratio of credit card agreements per population? Write just the 2-letter abbreviation of the state.
3.11 Submit what you have so far
Save your files and submit your work so far:
submit -c=sd212 -p=lab04 lab04.md fees.py states.py cardstates.txt tally.pyor
club -csd212 -plab04 lab04.md fees.py states.py cardstates.txt tally.pyor use the web interface
Part 4: Graph it (10 pts)
Modify your tally.py program so that it displays a scatterplot with
population on the x-axis and the credit card agreement count on the
y-axis. So each state should appear as a single dot, and (for example) a state on the
bottom-right would have a high population and low amount of credit card
companies.
Making the scatterplot using Plotly Express should be very simple if
you have your DataFrame worked out from the last part.
The second example on
this page
shows how to make a scatterplot from a dataframe and is a great starting
point.
A more complete description of the px.scatter() function is
here.
Add a trendline to your scatterplot (look at the documentation pages above to find the right option). The trendline will show sort of the “average” relationship between state population and credit card agreements.
Once you are happy with how your graph looks, save it to a file
called scatter.png. Then answer two final questions:
What are the most significant outliers in the dataset? In your graph, these would be the states that are farthest from the trendline. What states are way above or below the “typical” population-scaled average?
Choosing one outlier state you identified from the previous problem, try to do some quick research to make a plausible explanation of why that state has so many or so few credit card companies relative to its population.
(For example, some states have different tax laws, lending regulations, or court systems that may make them more or less attractive to the credit card companies.)
3.12 Submit your work
Save your files and submit your work:
submit -c=sd212 -p=lab04 lab04.md fees.py states.py cardstates.txt tally.py scatter.pngor
club -csd212 -plab04 lab04.md fees.py states.py cardstates.txt tally.py scatter.pngor use the web interface