A PDF Harvester in 25 Lines of Python

A PDF harvester in 25 Lines of python
Python
Published

February 20, 2024

The goal of this article is to develop a utility that handles the following:

  1. Retrieve HTML from a webpage.
  2. Parse the HTML and extract all references to embedded PDF links.
  3. For each PDF link, download the document and save it locally.

Plenty of 3rd-party libraries can query and retrieve a webpage’s links. However, the purpose of this post is to highlight the fact that by combining elements of the Python Standard Library with the Requests package, we can roll our own, and learn something while we’re at it.

Step I: Acquire HTML

This is straightforward using requests. Let’s query the Singular Value Decomposition page on Wikipedia:

import requests

url = "https://en.wikipedia.org/wiki/Singular_value_decomposition"

# instruct requests object to return HTML as plain text.
html = requests.get(url).text

html[:50]
'<!DOCTYPE html>\n<html class="client-nojs vector-fe'

The HTML has been obtained. Next we’ll identify and extract references to all embedded PDF links.

Step II: Extract PDF URLs from HTML

A cursory review of the HTML from webpages with embedded PDF links revealed the following:

  • Valid PDF URLs will in almost always be embedded within an href tag.
  • Valid PDF URLs will in all cases be preceded by http or https.
  • Valid PDF URLs will in all cases be enclosed by a trailing >.
  • Valid PDF URLs cannot contain whitespace.

After some trial and error, the following regular expression was found to have acceptable performance for our test cases:

"(?=href=).*(https?://\S+.pdf).*?>"

An excellent site to practice building and testing regular expressions is Pythex . The app allows you to construct regular expressions and determine how they match against the target text. I find myself using it on a regular basis.

Here is the logic associated with steps I and II combined:


import re
import requests

url = "https://en.wikipedia.org/wiki/Singular_value_decomposition"

# instruct requests object to return HTML as plain text.
html = requests.get(url).text

# Search html and compile PDF URLs in a list.
pdf_links = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)

for link in pdf_links:
    print(link)
http://www.wou.edu/~beavers/Talks/Willamette1106.pdf
http://www.alterlab.org/research/highlights/pone.0078913_Highlight.pdf
http://math.mit.edu/~edelman/publications/distribution_of_a_scaled.pdf
http://files.grouplens.org/papers/webKDD00.pdf
https://stanford.edu/~rezab/papers/dimsum.pdf
http://faculty.missouri.edu/uhlmannj/UC-SIMAX-Final.pdf

Note that the regular expression is prepended with an r when passed to re.findall. This instructs Python to interpret what follows as a raw string and to ignore escape sequences.

re.findall returns a list of matches extracted from the source text. In our case, it returns a list of URLs referencing the PDF documents found on the page.

For the last step we need to retrieve the documents associated with our collection of links and write them to file locally. We introduce another module from the Python Standard Library, os.path, which facilitates the partitioning of absolute filepaths into components in order to retain filenames when saving documents to file.

For example, consider the following url:

https://stanford.edu/~rezab/papers/dimsum.pdf

To capture dimsum.pdf, we pass the absolute URL to os.path.split, which returns a tuple of everything preceding the filename as the first element, along with the filename and extension as the second element:


import os

url = "https://stanford.edu/~rezab/papers/dimsum.pdf"
os.path.split(url)
('https://stanford.edu/~rezab/papers', 'dimsum.pdf')

This will be used to preserve the filename of the documents we save locally.

Step III: Write PDFs to File

This step differs from the initial HTML retrieval in that we need to request the content as bytes, not text. By calling requests.get(url).content, we’re accessing the raw bytes that comprise the PDF, then writing those bytes to file. Here’s the logic for the third and final step:

import os
import re
import requests

url = "https://en.wikipedia.org/wiki/Singular_value_decomposition"
html = requests.get(url).text
pdf_links = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)


# Request PDF content and write to file for all entries.
for pdf in pdf_links:

    # Get filename from url for naming file locally.
    pdf_name = os.path.split(pdf)[1].strip()
    
    try:
        r = requests.get(pdf).content
        with open(pdf_name, "wb") as f: 
            f.write(r)
    except:
        print(f"Unable to download {pdf_name}.")
    else:
        print(f"Saved {pdf_name}.")
Saved Willamette1106.pdf.
Saved pone.0078913_Highlight.pdf.
Saved distribution_of_a_scaled.pdf.
Saved webKDD00.pdf.
Saved dimsum.pdf.
Unable to download UC-SIMAX-Final.pdf.

Notice that we surround with open(pdfname, "wb")... in a try-except block: This handles situations that would prevent our code from downloading a document, such as broken redirects or invalid links.

All-in we end up with 16 lines of code excluding comments. We next present the full implementation of the PDF Harvester after a little reorganization:


import os.path
import re
import requests


def pdf_harvester(url):
    """
    Retrieve URLs html and extract references to PDFs. Download PDFs, 
    writing to current working directory. 

    Parameters
    ----------
    url: str
        Web address to serach for PDF links.
    """
    html = requests.get(url).text
    pdf_links = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)

    for pdf in pdf_links:
        
        # Get filename from url for naming file locally.
        pdf_name = os.path.split(pdf)[1].strip()

        try:
            r = requests.get(pdf).content
            with open(pdf_name, "wb") as f: 
                f.write(r)
        except:
            print(f"Unable to download {pdf_name}.")
        else:
            print(f"Saved {pdf_name}.")