Hello everyone! If you've been following my latest adventures, you know I've been immersed in the exciting world of Artificial Intelligence, experimenting with Docker Model Runner and, more recently, creating my own AI provider for Drupal (ai_provider_docker). Well, my next challenge to strengthen that module led me to a concept that's fundamental to modern AI: Embeddings!
To properly implement embeddings in my Drupal provider, I found myself needing to thoroughly investigate a very particular type of database: Vector Databases. Honestly, they sounded a bit intimidating at first, but I've discovered they are incredibly powerful tools for handling AI data.
What is a Vector Database (very briefly)?
Imagine you have millions of documents, images, or audio files. A normal database would search for exact keywords. A vector database, on the other hand, can understand the "meaning" or "context" of that data because it stores them as "vectors" (essentially, long lists of numbers that represent their characteristics). This allows for much smarter searches, like "find me documents that talk about a similar topic to this one," even if they don't use the exact same words. It's like going from searching by exact matches to searching by "similar ideas."
Milvus: My Choice to Start
In my exploration exercise, I decided on Milvus (milvusdb/milvus). Why Milvus? It seemed like a robust and flexible option to start understanding how these databases work in practice. Milvus doesn't work alone; to function completely, it uses other components like Etcd (for service coordination), MinIO (for object storage, like large vectors), and tools like Zilliz/Attu (a web interface to manage Milvus). All of this sounds complex, but the magic is that they work together to handle millions of vectors efficiently.
To get my local Milvus instance up and running for testing, I use a docker-compose.yml file. I'm still one of those who prefers to manually configure my development environments and create my own versions, but if you're using tools like DDEV, Lando, or others, that's perfectly fine—they often come with similar setups pre-configured.
Here's a simplified docker-compose.yml I use for my local Milvus setup:
services:
# Drupal container
drupal:
build:
context: ./docker/php
volumes:
- ./app:/var/www/app:delegated
working_dir: /var/www/app
healthcheck:
test: ["CMD", "php-fpm", "-t"]
interval: 30s
timeout: 10s
retries: 3
networks:
- backend
# Nginx container
nginx:
image: nginx:latest
ports:
- 80:80
volumes:
- ./docker/nginx/nginx.conf:/etc/nginx/conf.d/default.conf
- ./app:/var/www/app:delegated
depends_on:
- drupal
networks:
- backend
# MariaDB container
mariadb:
image: mariadb:10.11
ports:
- 3306:3306
restart: always
command:
- --disable-log-bin
- --innodb-buffer-pool-size=256M
- --max-connections=200
stop_grace_period: 30s
environment:
MYSQL_DATABASE: ${MYSQL_DATABASE}
MYSQL_USER: ${MYSQL_USER}
MYSQL_PASSWORD: ${MYSQL_PASSWORD}
MYSQL_ALLOW_EMPTY_PASSWORD: 1
MYSQL_TRANSACTION_ISOLATION: READ-COMMITTED
volumes:
- mariadb-data:/var/lib/mysql:delegated
networks:
- backend
# Key-value store
etcd:
image: quay.io/coreos/etcd:v3.5.0
container_name: etcd
environment:
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
- ETCD_QUOTA_BACKEND_BYTES=4294967296
volumes:
- ./etcd:/etcd
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
networks:
- backend
# Object storage
minio:
image: minio/minio:RELEASE.2020-12-03T00-03-10Z
container_name: minio
environment:
- MINIO_ACCESS_KEY=minioadmin
- MINIO_SECRET_KEY=minioadmin
volumes:
- ./minio-data:/minio_data
command: minio server /minio_data
healthcheck:
test: [ "CMD", "curl", "-f", "http://localhost:9000/minio/health/live" ]
interval: 30s
timeout: 20s
retries: 3
networks:
- backend
ports:
- "9000:9000"
# Vector database
milvus:
image: milvusdb/milvus:v2.4.8
container_name: milvus
command: [ "milvus", "run", "standalone" ]
environment:
- ETCD_ENDPOINTS=etcd:2379
- MINIO_ADDRESS=minio:9000
ports:
- "19530:19530"
- "19121:19121"
volumes:
- ./milvus-data:/var/lib/milvus/db
depends_on:
- etcd
- minio
networks:
- backend
# Milvus UI
attu:
container_name: attu
image: zilliz/attu:v2.5.11
environment:
MILVUS_URL: milvus:19530
ports:
- "3000:3000"
depends_on:
- milvus
networks:
- backend
volumes:
mariadb-data:
networks:
backend:To get this running, you just save it as docker-compose.yml in a directory and run docker compose up -d in your terminal. This spins up all the necessary components!
Integrating Milvus and "Superpowers" into Drupal
My main goal was to see how I could inject data from Drupal into a vector database and, more importantly, understand which specific use cases it could be useful for. I wanted to give our beloved Drupal the ability to perform searches beyond keywords, empowering its search engine with AI "superpowers."
To achieve this integration, I'm using the Drupal Search API AI module (https://www.drupal.org/project/search_api_ai). This module is the perfect starting point because it allows us to create a "server" for our vector database and an "indexer" to send our data from Drupal to Milvus. Additionally, I found a contributed module that facilitates the connection with Milvus: ai_vdb_provider_milvus (https://www.drupal.org/project/ai_vdb_provider_milvus). This module is key to bridging Drupal with Milvus!
After a good amount of configuration and testing, I can happily confirm that the embedding implementation in my contributed module (ai_provider_docker) worked! This means I can now take text in Drupal, convert it into an embedding (that numerical vector I mentioned earlier) using a local AI model, and then store that embedding in Milvus for advanced searches.
What's Next...
This is just a first glimpse of what I'm doing with vector databases. My next goal is to dedicate a full article to detailing how to implement a complete search engine from Drupal using Search API AI and, of course, leveraging this new capability of embeddings and Milvus.
I hope this small introduction to vector databases and Milvus has been as exciting for you as it has been for me! It's a field with enormous potential to improve how we interact with information on our websites. We keep learning!