Automatic Document Classification

I created this project because I always have so many documents to scan and categorize, and I had the hope that automating the process would make it simpler and therefore more attractive to do. The manuall work of typing out filenames with dates and other document details, and clicking through folder structures was often just high enough to make the paper pile up.

My goal was to build a Java-based pipeline that could ingest an unorganized folder of PDFs and images, figure out what they were, rename them to a naming format, and automatically move them into a copy of my existing directory structure. Because these are sensitive documents to me like bank statements and medical records, sending them to a third-party API like OpenAI was out of the question. The entire analysis had to run locally on my machine.

The architecture I ended up with operates in three sequential phases. First, the document is read using either a local Docker container running Tesseract OCR, which is very fast and cheap, or an Ollama vision model like llama3.2-vision for much higher accuracy on messy scans. I primarily use the second approach because it is much more reliable. Because the extracted text is often chaotic, the second step passes this raw data to a large language model that acts as a summarizer. Finally, the AI takes this clean summary to generate a descriptive filename and evaluates my actual existing directory tree to find the most logical subfolder to move the file into.

A process overview diagram of the different phases various document types go through.

When I first started developing this, I tried doing it all in one go, but the results were just not accurate. The model would hallucinate dates, ignore my naming conventions, or completely fail to understand the folder hierarchy. Breaking the prompts down step-by-step forces the AI to "think" properly. Here is an example where the deepseek-r1 model evaluates the directory tree to sort a document into a folder:

SYSTEM:
Sort a document into an existing directory structure WITHOUT creating new directories unless absolutely unavoidable and return a JSON object with the path you want to place the file in.
FILENAME: `2025-05-12 Musterstadt Bürgerbüro Ummeldung Max Mustermann.pdf`
EXISTING_STRUCTURE:
{
  "Offizielles": { ... },
}

MODEL:
<think>
Okay, so I need to figure out where to place this document in the given directory structure. Let's start by looking at all the information provided.
The filename mentions "Ummeldung", which translates to a change of address registration, and "Bürgerbüro", which is a local government office.
The top-level folder "03 Offizielles, Gesundheit" stands out because it includes official documents. Under this, there are subdirectories like "Amt" and "Behörden". 
Since the document is an address registration from a city office, placing the document in "03 Offizielles, Gesundheit/Behörden" makes the most sense because it relates to a standard administrative process.
</think>

{"path":"Offizielles/Behörden"}

While the accuracy is great in most of the cases, there is a bottleneck in this system when it comes to the hardware execution. I am running this on my NVIDIA GeForce RTX 3090, but keeping both a vision model and a reasoning model like deepseek-r1:32b loaded in memory at the same time is not possible. So, ollama switches between the two models for every single document, which alone adds roughly 100% to the total processing time. You can see in the output below, this results in processing times of roughly 90 to 110 seconds per document when using the vision model. This is an architectural flaw that could be improved by making the system batch process sequentially: run the OCR step for all files first, switch the model once, and then run the summary and naming steps for the entire batch.

┌--[01 / 01] -----------------------------------------------------------------
| IMG_20250128_0001.pdf
| [ 18.13s] ollama OCR:          1160 chars
| [ 51.89s] Document summarized: 1061 chars
| [ 18.65s] Filename generated:  2024-12-30 Rechnung Deichmann Girocard [...].pdf
| [ 15.93s] Path generated:      Rechnungen
| [104.63s] Moved file:          Rechnungen/2024-12-30 Rechnung Deichmann Girocard [...].pdf
└-----------------------------------------------------------------------------

I haven't fixed this yet, because I always use the tool in an autodetect background mode anyway. The scanner writes the files into a directory which is then automatically processed. I just let the process run for a while and come back, and it's done, it's not like it need the results right now most of the time anyway.

Did it cure my procrastination? Nah, but when I finally do sit down to clear it out, I use this tool every single time, and it makes the process a lot simpler for me.