Update a Typesense collection embeddings with Python

OpenAI recently made an exciting announcement regarding their new embeddings models, introducing lower pricing. This development is quite significant, considering the last highly utilized model was released back in December 2022, which can be considered a lengthy period in the world of AI.

However, it is worth noting that there is a downside to these new models based on my personal experience. These models do not exhibit great compatibility with each other, meaning that if you wish to employ the new model, you will need to update all the previously calculated embeddings.

To tackle this issue, I have developed a straightforward Python script that aids in refreshing the embeddings within my Typesense collections.

import os
import argparse
import typesense
from openai import OpenAI
import requests
from dotenv import load_dotenv
import json

# Load environment variables
load_dotenv()

# Parse command line arguments
parser = argparse.ArgumentParser(description='Update embeddings field in Typesense documents with OpenAI embeddings.')
parser.add_argument('--typesense-api-key', required=True, help='Typesense API key')
parser.add_argument('--typesense-host', required=True, help='Typesense host')
parser.add_argument('--typesense-port', required=True, help='Typesense port', type=int)
parser.add_argument('--typesense-protocol', default='http', choices=['http', 'https'], help='Typesense protocol')
parser.add_argument('--collection-name', required=True, help='Typesense collection name')
parser.add_argument('--field-name', default='vec', help='Embeddings field name')
parser.add_argument('--content-name', default='content', help='Content field name to generate embeddings')
parser.add_argument('--openai-api-key', default=os.environ.get('OPENAI_API_KEY'), help='OpenAI API key (can also be set via OPENAI_API_KEY environment variable)')
parser.add_argument('--verbosity', action='store_true', help='Enable verbose output')
parser.add_argument('--dry-run', action='store_true', help='Run script in dry run mode without actual updates')
args = parser.parse_args()

# Configure OpenAI
openaiClient = OpenAI(api_key=args.openai_api_key)

# Configure Typesense client
client = typesense.Client({
    'nodes': [{
        'host': args.typesense_host,
        'port': args.typesense_port,
        'protocol': args.typesense_protocol
    }],
    'api_key': args.typesense_api_key,
    'connection_timeout_seconds': 2
})


def fetch_embeddings(text):
    try:
        response = openaiClient.embeddings.create(input=[text], model="text-embedding-3-large", dimensions=1536)
        return response.data[0].embedding
    except Exception as e:
        print(f'Failed to fetch embeddings for: {text}')
        print(e)
        return None


def update_document(collection_name, document_id, embedding):
    if args.dry_run:
        print(f'Dry run: Would update {document_id} with embedding: {embedding[:5]}...')
    else:
        update_response = client.collections[collection_name].documents[document_id].update({
            args.field_name: embedding
        })
        if args.verbosity:
            print(f'Updated {document_id}: {update_response}')


def main():
    export_url = f"{args.typesense_protocol}://{args.typesense_host}:{args.typesense_port}/collections/{args.collection_name}/documents/export"
    headers = {"X-TYPESENSE-API-KEY": args.typesense_api_key}

    response = requests.get(export_url, headers=headers, stream=True)
    for line in response.iter_lines():
        if line:  # filter out keep-alive new lines
            document = json.loads(line.decode('utf-8'))  # Correctly converts byte literal to dict
            if args.field_name in document:
                text = document.get(args.content_name, '')  # Adjust field name as needed
                embedding = fetch_embeddings(text)
                if embedding:
                    update_document(args.collection_name, document['id'], embedding)


if __name__ == '__main__':
    main()

To use the script, you will need to install the required Python packages using the following command:

python3 update_embeddings.py \
  --typesense-api-key TS_API_KEY \
  --typesense-host TS_HOST \
  --typesense-port TS_PORT \
  --typesense-protocol https \
  --collection-name documents \
  --openai-api-key OPENAI_API_KEY

I have added the --dry-run flag to the script to allow you to test the script without making any actual updates. This is particularly useful when you are working with a large dataset and want to ensure that everything is working as expected.

It is worth mentioning that Typesense now has built-in support for OpenAI embeddings. However, I have found it beneficial to separate these two technologies in order to effectively handle scalability.