Reconstructing Shredded Documents

Introduction

In a brief article in the Technisch Weekblad I read that people won the DARPA shredder challenge, but that their technology was kept secret. It surprised me, I recalled watching an episode from The Lone Gunmen (the hacker guys from the X-Files) in which they reconstructed shredded paper with a computer program. Contrary to most movie computer stuff, I had considered this plausible and assumed that this was more or less a solved problem. Apparently not. I started thinking how one would solve it, and after a few days I really wanted to try it out. This page describes the results of my humble experiment.

Overview

This is the approach of my algorithm, this text roughly follows it.

  1. Scan shreds and feed image to algorithm
  2. Detect shreds, polygonise
  3. Detect `features' on shred edges
  4. Assign a score to pairs of edges using features
  5. Cluster shreds, maximising the score

Scanning

I wrote a text on a piece of paper and cut it into twelve pieces along the lines. The few paper shredders I have seen produce long thin shreds, and that's what I did too.

scan

For my I own convenience, I numbered the shreds before cutting. My scanner projected shadows on the blue background paper, so I cleaned that up and replaced the noisy background blue with a constant colour for easy snippet/background segmentation. The following image is the input to my algorithm:

input

Creating Polygons

Polygons are created that follow exactly the outlines of the shreds. I initially simplified the polygons prior to feature detection, but this created problems, so I disabled that.

polygons

Bright green are the edges of the polygons, the vertices are pink crosses.

Detecting Features

Features are detected by walking the edges of the polygons. When the colour changes from white to black, a feature starts. When the colour changes back to white, the feature ends. The centre (average of the start and end position) and the averaged shred normal describe a feature. I considered adding size, colour and gradient (or another approximation of the shape), but position and shred normal prove to be sufficient. For cases where there are fewer features available on the shreds, I believe that adding more feature information is needed though. This is the result of the feature detection:

features

The outlines are coloured red when on a feature, and blue otherwise. The circles denote centres of detected features.

Matching Features

First, the polygons are simplified to speed up further calculations. For each pair of shreds (Sa, Sb) I try all pairs of feature-pairs (Fa0, Fa1), (Fb0, Fb1) and align the two shreds so that Fb0 is on Fa0 and the line Fb0--Fb1 matches Fa0--Fa1. Initially, I used a heuristic: find for the two features on snippet Sa the two features on Sb that have an equal as possible distance, and align the two pairs. This approach yielded many very equidistant feature pairs, but none of it would lead to valid alignments. I disabled it, and simply iterated brute through all alignments.

When two snippets are aligned, a score is assigned to it. Features are moved along their shred normal (perpendicular to the paper border), and tested if the fall into the other shred (inside the simplified polygon). Only these features are used for the score. For each feature Fa, a feature Fb is sought that maximises a metric. After trying several metrics and approaches (minimising versus maximising) I used metric = (1 - 2 * smoothStep(D1, D2, dist)). This metric is 1 for features closer than D1; -1 for features farther away than D2 and changes smoothly from 1 to -1 for distances [D1, D2]. This metric does not even use the paper normals, but adding dot product would probably work. The final score for two shreds is the sum of all feature-match results. I considered calculating the overlap of two shreds (the intersection of the two polygons), but this was not necessary.

Solving

After all possible alignments for each shred pair is computed, we can solve. First, the pair with the best possible alignment (highest score) is sought. These two shreds form the seed. The features used to combine the two are added to a list with used features. Next, another shred is sought that connects to any shred in the seed without using previously used features. The shred with the highest score is added, et cetera. I did not expect this to work directly, i.e. it would be likely that a wrong alignment would appear somewhere, ruining the total solution. So I was prepared to assign a score to the total solution (or ask for human assistance) and try several trees of solutions. Fortunately, this was not needed, the first iteration yielded the correct solution, and I didn't bother to write more code.

solution

The solution, with debug output artefacts. The translation of the text is: "this will be the easier TEST, :) I hope at least... Roel".

Discussion and Future Work (not by me)

My approach appeared to work, for my very limited single test case. When I looked at the puzzles DARPA provided, their shreds had different properties than the shreds I had in mind while coding and testing: short snippets with a few features rather than my long strips with many features. If I had participated to the DARPA challenge (which I probably had if I heard of it before it closed, and if I were a U.S. resident), I would have used the shape of the shreds too, and would have extracted more information from the features as there are fewer. For real world cases, where also shreds that are not part of the solution are considered, my algorithm would not work in its current state. I didn't bother to enhance my approach or test other cases, as I already spend enough time behind the keyboard.

I hope that my write-up inspires people to create awesome FOSS generic unshredders. I decided not to provide the source code, as it is just nothing more than a representation of the ideas presented here, but in an obfuscated form. If you have questions, ask me. ?

Home

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.