Skip to main content

ðŸŠķ Apache Tika Extraction

warning

This tutorial is a community contribution and is not supported by the OpenWebUI team. It serves only as a demonstration on how to customize OpenWebUI for your specific use case. Want to contribute? Check out the contributing tutorial.

ðŸŠķ Apache Tika Extraction​

This documentation provides a step-by-step guide to integrating Apache Tika with Open WebUI. Apache Tika is a content analysis toolkit that can be used to detect and extract metadata and text content from over a thousand different file types. All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

Prerequisites​

  • Open WebUI instance
  • Docker installed on your system
  • Docker network set up for Open WebUI

Integration Steps​

Step 1: Create a Docker Compose File or Run the Docker Command for Apache Tika​

You have two options to run Apache Tika:

Option 1: Using Docker Compose

Create a new file named docker-compose.yml in the same directory as your Open WebUI instance. Add the following configuration to the file:

services:
tika:
image: apache/tika:latest-full
container_name: tika
ports:
- "9998:9998"
restart: unless-stopped

Run the Docker Compose file using the following command:

docker-compose up -d

Option 2: Using Docker Run Command

Alternatively, you can run Apache Tika using the following Docker command:

docker run -d --name tika \
-p 9998:9998 \
-restart unless-stopped \
apache/tika:latest-full

Note that if you choose to use the Docker run command, you'll need to specify the --network flag if you want to run the container in the same network as your Open WebUI instance.

Step 2: Configure Open WebUI to Use Apache Tika​

To use Apache Tika as the context extraction engine in Open WebUI, follow these steps:

  • Log in to your Open WebUI instance.
  • Navigate to the Admin Panel settings menu.
  • Click on Settings.
  • Click on the Documents tab.
  • Change the Default content extraction engine dropdown to Tika.
  • Update the context extraction engine URL to http://tika:9998.
  • Save the changes.

Verifying Apache Tika in Docker

To verify that Apache Tika is working correctly in a Docker environment, you can follow these steps:

1. Start the Apache Tika Docker Container​

First, ensure that the Apache Tika Docker container is running. You can start it using the following command:

docker run -p 9998:9998 apache/tika

This command starts the Apache Tika container and maps port 9998 from the container to port 9998 on your local machine.

2. Verify the Server is Running​

You can verify that the Apache Tika server is running by sending a GET request:

curl -X GET http://localhost:9998/tika

This command should return the following response:

This is Tika Server. Please PUT

3. Verify the Integration​

Alternatively, you can also try sending a file for analysis to test the integration. You can test Apache Tika by sending a file for analysis using the curl command:

curl -T test.txt http://localhost:9998/tika

Replace test.txt with the path to a text file on your local machine.

Apache Tika will respond with the detected metadata and content type of the file.

Using a Script to Verify Apache Tika​

If you want to automate the verification process, this script sends a file to Apache Tika and checks the response for the expected metadata. If the metadata is present, the script will output a success message along with the file's metadata; otherwise, it will output an error message and the response from Apache Tika.

import requests

def verify_tika(file_path, tika_url):
try:
# Send the file to Apache Tika and verify the output
response = requests.put(tika_url, files={'file': open(file_path, 'rb')})

if response.status_code == 200:
print("Apache Tika successfully analyzed the file.")
print("Response from Apache Tika:")
print(response.text)
else:
print("Error analyzing the file:")
print(f"Status code: {response.status_code}")
print(f"Response from Apache Tika: {response.text}")
except Exception as e:
print(f"An error occurred: {e}")

if __name__ == "__main__":
file_path = "test.txt" # Replace with the path to your file
tika_url = "http://localhost:9998/tika"

verify_tika(file_path, tika_url)

Instructions to run the script:

Prerequisites​

  • Python 3.x must be installed on your system
  • requests library must be installed (you can install it using pip: pip install requests)
  • Apache Tika Docker container must be running (use docker run -p 9998:9998 apache/tika command)
  • Replace "test.txt" with the path to the file you want to send to Apache Tika

Running the Script​

  1. Save the script as verify_tika.py (e.g., using a text editor like Notepad or Sublime Text)
  2. Open a terminal or command prompt
  3. Navigate to the directory where you saved the script (using the cd command)
  4. Run the script using the following command: python verify_tika.py
  5. The script will output a message indicating whether Apache Tika is working correctly

Note: If you encounter any issues, ensure that the Apache Tika container is running correctly and that the file is being sent to the correct URL.

Conclusion​

By following these steps, you can verify that Apache Tika is working correctly in a Docker environment. You can test the setup by sending a file for analysis, verifying the server is running with a GET request, or use a script to automate the process. If you encounter any issues, ensure that the Apache Tika container is running correctly and that the file is being sent to the correct URL.

Troubleshooting​

  • Make sure the Apache Tika service is running and accessible from the Open WebUI instance.
  • Check the Docker logs for any errors or issues related to the Apache Tika service.
  • Verify that the context extraction engine URL is correctly configured in Open WebUI.

Benefits of Integration​

Integrating Apache Tika with Open WebUI provides several benefits, including:

  • Improved Metadata Extraction: Apache Tika's advanced metadata extraction capabilities can help you extract accurate and relevant data from your files.
  • Support for Multiple File Formats: Apache Tika supports a wide range of file formats, making it an ideal solution for organizations that work with diverse file types.
  • Enhanced Content Analysis: Apache Tika's advanced content analysis capabilities can help you extract valuable insights from your files.

Conclusion​

Integrating Apache Tika with Open WebUI is a straightforward process that can improve the metadata extraction capabilities of your Open WebUI instance. By following the steps outlined in this documentation, you can easily set up Apache Tika as a context extraction engine for Open WebUI.