< Back to Notes

Scripted find and replace in PDF

Every month my SO needs to redact a piece of personally identifiable information from a few PDFs. If you ever tried editing a PDF without a commercial PDF editor you probably know it's far from straightforward.

LibreOffice Draw works quite well. But why not spend a day or two automating a task that otherwise takes maybe 20 minutes to do with Draw?

The goal: find a way to run a find and replace operation on a few different PDFs via Windows command line and put the whole pricess in a script, PowerShell in this case.

To my great fortune, I have shoulders of giants to stand on. Jonathan Scott Enderle was trying to modify PDF text by hand and his findings helped me to solve the toughest part – cracking the PDF to get to something editable.

Jonathan calls the PDF standard Byzantine. After seeing it myself, I'd call it arcane. But I like Byzantine, i.e. overly complex. New word learned.

Decompress the PDF

The first step is to decompress the PDF to make it radable in a text editor. The tool for that is qpdf, a CLI PDF transformer. It can split and merge PDFs, among others, but for this particular exercise I need to transform it into a form that is editable with a text editor. This is called uncompressing, or decompressing.

The option for decompressing is --qdf. And according to this StackOverflow answer (via Jonathan's article) it's recommended to add the option --object-streams=disable to create output files with no object streams.

The full command I used to get a decompressed PDF:

qpdf.exe --qdf --object-streams=disable original.pdf decompressed.pdf

Reading the decompressed PDF

Again thanks to Jonathan's article, it didn't take me long to find the right place. Without his explanation, I'd be quite baffled with the first PDF I decompressed. It uses exactly what Jonathan mentions, a so-called ToUnicode mapping. Unicode codepoints are mapped to other 4-digit hexadecimal codepoints.

For instance, in my file the Unicode codepoint 0035 (the digit 5) is mapped to 0018 for all fonts used in the PDF.

I didn't look into it, but I half-expect that this mapping varies. Hopefully it stays fixed for a PDF that comes from the same source. Based on Jonathan's findings and Adobe's ToUnicode Mapping File Tutorial I'd assume the mapping will vary from font to font. In my file it doesn't.

My file has mappings for a few font styles, such as Arial in this example:

    ...
/CMapName /Arial def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
72 beginbfchar
<0013> <0030>
<001A> <0037>
<0011> <002E>
<0015> <0032>
<0018> <0035>
    ...

Mappings start after beginbfchar. The left column is the PDF's "internal codepoint", called CID, The right column is the Unicode codepoint. The mappings seem to be in no particular order.

The first few mappings from this example are:

<0013> <0030> # digit zero
<001A> <0037> # digit seven
<0011> <002E> # full stop
<0015> <0032> # digit two
<0018> <0035> # digit five

Even though the mappings are not listed in order, they seem to follow, at least in my file, the same sequence as Unicode. 0013 is the digit zero, 0014 is the digit one, etc.

I don't need to look up every single digit, just keep counting – in hexadecimal – from the CID for zero. Once I have all the CIDs I can just string them together into one long sequence and search for that in the decompressed PDF.

For instance, if I want to find 567, according to the above mapping that is 00180019001A in CID.

The same principle applies for any other characters. I'd just first need to look up a character's Unicode codepoint from the correct table. Easiest to start is with ASCII (basic Latin) characters. Single non-ASCII characters can be guessed from the surroundings, so I wouldn't need to look up every single character.

Replacing the found sequence works on the same principle, just reversed. Build a sequence of Unicode codepoints, then find their equivalents in the CID mapping.

In my case, it it easy, I just need to replace all digits with spaces, effectively "deleting" the digits. In my PDF's mapping 0020 (codepoint for space) maps to 0003, so I replace the whole found sequence with as many repetitions of 0003.

Not all PDFs are created equal

I have a few other PDFs where I need the PII removed. I decompressed them the same way as the first one. But fortunately, all the other PDFs don't use CID mapping. Instead they have the content in "plain" text. Plain is quite a strech here, there is still a lot of encoded stuff. I wouldn't call it markup, just come cryptic PDF markings.

You still cannot read the whole content in a block of text. But short passages of text appear as a readable (and searchable!) string in the decoded file. Non-ASCII characters are still encoded with what I called PDF markings earlier. For example Tj /C003 7.00 Tf <c4>Tj /C001 7.00 Tf 0.00 Tw seems to represent Ä.

I didn't research on this further, as the digits I needed to replace are easily found. I deleted the digits instead of replacing them with spaces as with the first file.

Re-compress the PDF

Once the editing is done, I just need to re-compress the PDF with qpdf. It takes just two parameters, no additional options needed:

qpdf.exe decompressed-edited.pdf recompressed.pdf

I got warning with the second type of PDF, the simpler one without CID mapping, something along the lines of content length being mismatched (need to get the actual message). Despite that all the re-compressed PDFs display fine.

As a quick experiment, I replaced all digits with the same number or zeros, so the content stays the same length (remember, I deleted the digits in the second type of PDF). But the warning when re-compressing was still there. I'm deciding to ignore it, because the redacted PDFs all render just fine.

Going full auto

No automation is complete without a script. Of course I went and wrote a PowerShell script that handles decompression, find and replace, and re-compression for a few different PDFs all in one go.

The script quote specific for my use case, so I won't put it here in full, but at some point I might make a generalized showcase version adn add it here.

These resources were very helpful in making the script: