Extract Text from Your PDF and Image Files with Apache Tika
With AI becoming increasingly popular in everything, and retrieval-augmented generation (RAG) becoming a requirement in everyone's organization, how you're providing context to the AI tools becomes important.
Some of the more popular subscription model tools like ChatGPT accept images and files in the prompts, at which point it can decipher what's in those files, but many of the local models and tools can only work with plaintext.
There have been a few personal scenarios where I've needed to work with PDFs in my AI tools. Such examples include:
- The LiveKit voice agent demo I created that accepts resumes and job descriptions in any document format.
- The self-hosted Open WebUI chat tool that lets the user create a knowledgebase from various document formats.
So how do we make it work?
Take a look at Apache Tika, an open source tool that extracts metadata and text from popular file formats and returns it as plaintext. In this short tutorial, we're going to see how to deploy Apache Tika and watch it work its magic.
Deploying an Apache Tika Container with Docker
While you don't need Docker to use Apache Tika, it is by far the easiest way to make use of Apache Tika because it can be deployed with a single easy command. You don't even need to worry about volume mappings to make it work.
Apache Tika ships two image variants. The standard image handles most document formats like PDF, DOCX, and PPTX out of the box. If you also need to extract text from image files like PNG or JPG, you'll want the -full image, which bundles Tesseract OCR:
| Use case | Image |
|---|---|
| PDFs and office documents | apache/tika:latest |
| Images (PNG, JPG, TIFF, etc.) | apache/tika:latest-full |
You can execute the following with Docker installed:
docker run -d -p 9998:9998 apache/tika:latest-fullUnless you plan to launch your other services on the same container network, it's important to expose the port on the host.
If you'd rather use Docker Compose, you can create a docker-compose.yml file like the following:
services:
tika:
image: apache/tika:latest-full
ports:
- "9998:9998"To use the docker-compose.yml file you'd execute a command like docker compose up with your command line.
There's nothing more to it, your Apache Tika is ready for use.
Extract Text from Documents and Other File Formats
Apache Tika has a lot of features. You can get an idea of what's possible by navigating to localhost:9998 in your web browser. We're going to stick to the basics, send a document and get text back.
From a command line you can execute something like this:
curl --request PUT \
--url http://localhost:9998/tika \
--header 'content-type: application/octet-stream' \
--data-binary "@/PATH/TO/MY/FILE.pdf"Provide a proper file path and Apache Tika service URL and you should end up with an XML response in return. This works the same way for image files. Just swap in the path to your PNG, JPG, or TIFF and Tika will run OCR on it and return the extracted text. If you're using the response with an LLM, the XML format should be fine, but you can add the following accept header to return the content as plaintext:
curl --request PUT \
--url http://localhost:9998/tika \
--header 'accept: text/plain' \
--header 'content-type: application/octet-stream' \
--data-binary "@/PATH/TO/MY/FILE.pdf"Like I mentioned previously, I have been using Apache Tika with Open WebUI. Within the tool you can add Apache Tika in the Admin -> Settings -> Documents section. This will allow you to upload documents in the Workspace -> Knowledge section to be used for RAG.
You can also check out how I used Apache Tika in an AI Voice Agent Interview Coach.
Conclusion
You just saw how to quickly get started with Apache Tika. While Docker isn't the only way to use Apache Tika, it is definitely an easy way. Once you have an Apache Tika service available, you can start sending documents to it over HTTP in exchange for XML or plaintext responses.

Nic Raboy
Nic Raboy is an advocate of modern web and mobile development technologies. He has experience in C#, JavaScript, Golang and a variety of frameworks such as Angular, NativeScript, and Unity. Nic writes about his development experiences related to making web and mobile development easier to understand.