SingleStore VectorStore

A high-performance vector database library for storing and querying vector embeddings in SingleStore DB. Designed to efficiently manage and search through high-dimensional vector data for AI/ML applications, semantic search, and recommendation systems.

Installation
Overview
Getting Started
Connecting to SingleStore
Creating and Managing Indexes
Working with Vectors
Querying Vectors
Advanced Features
API Reference
Best Practices

Installation

Install the package using pip:

pip install singlestore-vectorstore

Overview

SingleStore VectorStore is a Python library that provides:

Simple API for vector similarity search
Efficient indexing for high-dimensional vectors
Support for multiple distance metrics (Cosine, Dot Product, Euclidean)
Metadata filtering capabilities
Connection pooling for performance
Namespace support for organizing vectors

Getting Started

Basic Usage

from vectorstore import VectorDB, Metric, Vector

# Initialize the VectorDB
db = VectorDB(
    host="localhost",
    user="root",
    password="password",
    database="embeddings_db"
)

# Create an index
db.create_index(
    name="my_embeddings",
    dimension=1536,  # e.g., for OpenAI embeddings
    metric=Metric.COSINE,
)

# Get a reference to the index
index = db.Index("my_embeddings")

# Add vectors to the index
vectors = [
    Vector(id="doc1", vector=[0.1, 0.2, 0.3, ...], metadata={"source": "article"}),
    Vector(id="doc2", vector=[0.2, 0.3, 0.4, ...], metadata={"source": "webpage"})
]
index.upsert(vectors)

# Find similar vectors
results = index.query(
    vector=[0.15, 0.25, 0.35, ...],
    top_k=5,
    include_metadata=True
)

# Print results
for match in results:
    print(f"ID: {match['id']}, Score: {match['score']}, Metadata: {match['metadata']}")

Connecting to SingleStore

Connection Options

There are several ways to connect to SingleStore DB:

1. Direct Connection Parameters

Direct connection parameters can be passed as separate parameters:

from vectorstore import VectorDB

db = VectorDB(
    host="localhost",
    port=3306,
    user="root",
    password="password",
    database="vectors"
)

Or as a connection URL:

from vectorstore import VectorDB

db = VectorDB(
    host="root:password@localhost:3306/vectors"
)

Or as environment variables:

os.environ['SingleStore_URL'] = 'me:[email protected]/my_db'
db = VectorDB()

The VectorDB supports all ways of connection supported by original singlestordb python client.

2. Existing Connection

from singlestoredb import connect
from vectorstore import VectorDB

# Create a connection
connection = connect(
    host="localhost",
    user="root",
    password="password",
    database="vectors"
)

# Use the existing connection
db = VectorDB(connection=connection)

3. Connection Pool (Recommended for Production)

from sqlalchemy.pool import QueuePool
from singlestoredb import connect
from vectorstore import VectorDB

# Create a connection pool
def create_connection():
    return connect(
        host="localhost",
        user="root",
        password="password",
        database="vectors"
    )

connection_pool = QueuePool(
    creator=create_connection,
    pool_size=10,
    max_overflow=20,
    timeout=30
)

# Use the connection pool
db = VectorDB(connection_pool=connection_pool)

Creating and Managing Indexes

Creating an Index

from vectorstore import VectorDB, Metric, DeletionProtection

db = VectorDB(host="localhost", user="root", password="password", database="vectors")

# Create a simple index
basic_index = db.create_index(
    name="basic_index",
    dimension=1536,
)

# Create a more customized index
custom_index = db.create_index(
    name="custom_index",
    dimension=768,
    metric=Metric.EUCLIDEAN,
    deletion_protection=DeletionProtection.ENABLED,
    tags={"model": "sentence-transformers", "version": "v1.0"},
    use_vector_index=True,
    vector_index_options={
        "index_type": "IVF_PQFS",
        "nlist": 1024,
        "nprobe": 20
    }
)

Vector Index Options

When creating an index with use_vector_index=True, you can configure various index types and parameters to optimize for your specific use case. SingleStore supports several vector index types, each with different performance characteristics:

vector_index_options={
    "index_type": "IVF_FLAT",  # Specify the index type
    "nlist": 1024,             # Number of clusters/centroids
    "nprobe": 20,              # Number of clusters to search during query time
    # Additional parameters specific to each index type...
}

Supported Index Types

FLAT
- Brute force approach that compares against every vector
- Highest accuracy but slowest for large datasets
- No additional parameters required
- Best for: Small datasets or when accuracy is critical
IVF_FLAT (Inverted File with Flat Quantizer)
- Uses clustering to accelerate searches
- Good balance of quality and performance
- Parameters:
  - nlist: Number of centroids/clusters (default 100, higher values improve accuracy but slow down indexing)
  - nprobe: Number of clusters to search at query time (default 1, higher values improve accuracy but slow down search)
- Best for: Medium-sized datasets with moderate query performance requirements
IVF_SQ (Inverted File with Scalar Quantization)
- Compresses vectors to reduce memory usage
- Parameters:
  - nlist, nprobe: Same as IVF_FLAT
  - qtype: Quantizer type, either "QT8" (8-bit) or "QT4" (4-bit)
- Best for: Large datasets where memory usage is a concern
IVF_PQ (Inverted File with Product Quantization)
- Advanced compression technique that divides vectors into subvectors
- Parameters:
  - nlist, nprobe: Same as IVF_FLAT
  - m: Number of subvectors (default: dimension / 2)
  - nbits: Bits per subvector (default: 8)
- Best for: Very large datasets where memory usage is critical
IVF_PQFS (Inverted File with PQ Fast Scan)
- Optimized version of IVF_PQ with SIMD acceleration
- Parameters:
  - nlist, nprobe: Same as IVF_FLAT
  - m: Number of subvectors (must be multiple of 4)
  - nbits: Bits per subvector (must be 8)
- Best for: Production systems with large datasets and high query throughput
HNSW (Hierarchical Navigable Small World)
- Graph-based approach that builds navigation network between vectors
- Very fast queries but slower index building
- Parameters:
  - M: Number of edges per node (default: 12)
  - efConstruction: Size of dynamic list during construction (default: 40)
  - ef: Size of dynamic list during search (default: 10)
  - random_seed: Random seed for reproducibility (default: current time)
- Best for: Applications requiring extremely fast search on moderate-sized datasets

Parameter Tuning Guidelines

Increasing nlist: Improves search speed but requires more memory and longer index build time
Increasing nprobe: Improves accuracy but slows down searches
For IVF_PQ/PQFS:
- Lower m values: Faster search but lower accuracy
- Higher m values: Better accuracy but slower search
For HNSW:
- Higher M values: Better accuracy but larger index size and longer build time
- Higher ef values: Better accuracy but slower search

For complete details on vector indexing options, see the SingleStore Vector Indexing documentation.

Listing Indexes

# Get all indexes
indexes = db.list_indexes()

# Print index details
for idx in indexes:
    print(f"Index: {idx.name}, Dimension: {idx.dimension}, Metric: {idx.metric}")

Describing an Index

# Get detailed information about an index
index_info = db.describe_index("my_index")
print(f"Name: {index_info.name}")
print(f"Dimension: {index_info.dimension}")
print(f"Metric: {index_info.metric}")
print(f"Deletion Protection: {index_info.deletion_protection}")
print(f"Tags: {index_info.tags}")
print(f"Uses Vector Index: {index_info.use_vector_index}")
print(f"Vector Index Options: {index_info.vector_index_options}")

Configuring an Index

# Update index settings
db.configure_index(
    name="my_index",
    deletion_protection=DeletionProtection.ENABLED,
    tags={"updated": "true", "version": "v2.0"},
    use_vector_index=True,
    vector_index_options={
        "index_type": "IVF_FLAT",
        "nlist": 2048
    }
)

Checking If an Index Exists

if db.has_index("my_index"):
    print("Index exists")
else:
    print("Index doesn't exist")

Deleting an Index

# Delete an index
db.delete_index("my_index")

# This will fail if deletion protection is enabled
try:
    db.delete_index("protected_index")
except ValueError as e:
    print(f"Could not delete: {e}")

Working with Vectors

Different Ways to Represent Vectors

from vectorstore import Vector

# Method 1: Using Vector class
vectors = [
    Vector(id="vec1", vector=[0.1, 0.2, 0.3], metadata={"category": "A"}),
    Vector(id="vec2", vector=[0.4, 0.5, 0.6], metadata={"category": "B"})
]

# Method 2: Using tuples (id, values)
vectors_tuples = [
    ("vec3", [0.7, 0.8, 0.9]),
    ("vec4", [0.10, 0.11, 0.12])
]

# Method 3: Using tuples with metadata (id, values, metadata)
vectors_with_meta = [
    ("vec5", [0.13, 0.14, 0.15], {"category": "C"}),
    ("vec6", [0.16, 0.17, 0.18], {"category": "D"})
]

# Method 4: Using dictionaries
vectors_dict = [
    {"id": "vec7", "values": [0.19, 0.20, 0.21], "metadata": {"category": "E"}},
    {"id": "vec8", "values": [0.22, 0.23, 0.24], "metadata": {"category": "F"}}
]

Inserting Vectors

# Get index reference
index = db.Index("my_index")

# Insert vectors
count = index.upsert(vectors)
print(f"Inserted {count} vectors")

# Insert with namespace
index.upsert(vectors_tuples, namespace="group1")
index.upsert(vectors_with_meta, namespace="group2")

Using Pandas DataFrames

import pandas as pd

# Create a DataFrame with vector data
df = pd.DataFrame([
    {"id": "vec1", "values": [0.1, 0.2, 0.3], "metadata": {"category": "A"}},
    {"id": "vec2", "values": [0.4, 0.5, 0.6], "metadata": {"category": "B"}}
])

# Upsert from DataFrame
count = index.upsert_from_dataframe(df, namespace="pandas_import")
print(f"Imported {count} vectors from DataFrame")

Updating Vectors

# Update vector values
index.update(
    id="vec1",
    values=[0.25, 0.35, 0.45]
)

# Update metadata only
index.update(
    id="vec2",
    set_metadata={"category": "updated", "version": 2}
)

# Update both values and metadata with namespace
index.update(
    id="vec3",
    values=[0.55, 0.65, 0.75],
    set_metadata={"processed": True},
    namespace="group1"
)

Fetching Vectors

# Get vectors by ID
vectors = index.fetch(
    ids=["vec1", "vec2", "vec3"]
)

# Get vectors by ID with namespace
vectors_in_namespace = index.fetch(
    ids=["vec3", "vec4"],
    namespace="group1"
)

# Access vector data
for vec_id, vec_obj in vectors.items():
    print(f"ID: {vec_id}")
    print(f"Vector: {vec_obj.vector[:5]}...")  # Print first 5 elements
    print(f"Metadata: {vec_obj.metadata}")

Deleting Vectors

# Delete vectors by ID
index.delete(ids=["vec1", "vec2"])

# Delete vectors by ID in a namespace
index.delete(ids=["vec3", "vec4"], namespace="group1")

# Delete all vectors in a namespace
index.delete(delete_all=True, namespace="group2")

# Delete vectors matching a filter
index.delete(
    filter={"category": "A"},
    namespace="pandas_import"
)

Listing Vector IDs

# List all vector IDs
ids = index.list()

# List vectors with a prefix
ids_with_prefix = index.list(prefix="doc_")

# List vectors in a namespace
ids_in_namespace = index.list(namespace="group1")

Getting Index Statistics

# Get statistics about the index
stats = index.describe_index_stats()

print(f"Dimension: {stats['dimension']}")
print(f"Total Vector Count: {stats['total_vector_count']}")

# Namespace statistics
for ns_name, ns_stats in stats['namespaces'].items():
    print(f"Namespace: {ns_name}, Vectors: {ns_stats['vector_count']}")

# Get filtered statistics
filtered_stats = index.describe_index_stats(
    filter={"category": "A"}
)

Querying Vectors

Basic Query

# Query by vector values
results = index.query(
    vector=[0.1, 0.2, 0.3, ...],
    top_k=5
)

# Print results
for match in results:
    print(f"ID: {match['id']}, Score: {match['score']}")

Query Options

# Query with metadata and vector values in response
results = index.query(
    vector=[0.1, 0.2, 0.3, ...],
    top_k=10,
    include_metadata=True,
    include_values=True
)

# Query by existing vector ID
results = index.query(
    id="vec1",  # Use this vector's values for the query
    top_k=5
)

# Query within a namespace
results = index.query(
    vector=[0.1, 0.2, 0.3, ...],
    namespace="group1",
    top_k=5
)

# Query across multiple namespaces
results = index.query(
    vector=[0.1, 0.2, 0.3, ...],
    namespaces=["group1", "group2"],
    top_k=5
)

Query with Filtering

# Simple equality filter
results = index.query(
    vector=[0.1, 0.2, 0.3, ...],
    top_k=10,
    filter={"category": "A"}
)

# Comparison operators
results = index.query(
    vector=[0.1, 0.2, 0.3, ...],
    top_k=10,
    filter={"year": {"$gt": 2020}}
)

# Multiple conditions with AND
results = index.query(
    vector=[0.1, 0.2, 0.3, ...],
    top_k=10,
    filter={
        "$and": [
            {"category": "article"},
            {"year": {"$gte": 2020}}
        ]
    }
)

# Multiple conditions with OR
results = index.query(
    vector=[0.1, 0.2, 0.3, ...],
    top_k=10,
    filter={
        "$or": [
            {"category": "article"},
            {"category": "blog"}
        ]
    }
)

# Check if field exists
results = index.query(
    vector=[0.1, 0.2, 0.3, ...],
    top_k=10,
    filter={"author": {"$exists": True}}
)

# Collection operators
results = index.query(
    vector=[0.1, 0.2, 0.3, ...],
    top_k=10,
    filter={"category": {"$in": ["article", "blog", "news"]}}
)

Vector Index Control

Vector indexes significantly accelerate similarity searches, especially with large datasets, but there's always a tradeoff between search speed and accuracy. Higher accuracy settings typically result in slower searches, while faster searches may return slightly less optimal results.

# Disable vector index for this query
results = index.query(
    vector=[0.1, 0.2, 0.3, ...],
    top_k=10,
    disable_vector_index_use=True  # Force brute-force search for maximum accuracy
)

# Customize search options based on index type
results = index.query(
    vector=[0.1, 0.2, 0.3, ...],
    top_k=10,
    search_options={
        # Parameters vary by index type
        "nprobe": 50,  # For IVF-based indexes
        "ef": 100      # For HNSW indexes
    }
)

Search Parameters by Index Type

Each vector index type supports different search-time parameters that control the speed vs. accuracy tradeoff:

ALL TYPES python search_options={ "k": 50 # number of rows outputted by vector index scan. k must be >= top_k }

FLAT
- No tunable search parameters (always performs exhaustive search)
- Always returns exact results with highest accuracy

IVF_FLAT, IVF_SQ, IVF_PQ, IVF_PQFS

search_options={
    "nprobe": 20  # Number of clusters to search (higher = more accurate, but slower)
                  # Default is 1, common range: 5-100 depending on dataset size
}

HNSW

search_options={
    "ef": 40      # Size of dynamic candidate list (higher = more accurate, but slower)
                  # Default is 10, common range: 20-200 depending on dataset size
}

Tuning Tips

Start with default values and increase gradually until you find the right balance
For high recall requirements, use higher parameter values (higher nprobe or ef)
For time-sensitive applications, use lower values

Performance measurement example:

import time

# Measure search time vs. accuracy tradeoff
for nprobe in [1, 10, 50, 100]:
    start = time.time()
    results = index.query(
        vector=query_vector,
        top_k=10,
        search_options={"nprobe": nprobe}
    )
    end = time.time()
    print(f"nprobe={nprobe}, time={end-start:.4f}s")
    # Compare results with ground truth if available

For more details on vector index parameters, refer to the SingleStore Vector Indexing documentation.

Advanced Features

Working with Different Distance Metrics

# Create indexes with different metrics
cosine_index = db.create_index(
    name="cosine_index",
    dimension=1536,
    metric=Metric.COSINE  # Normalized dot product, best for comparing directions
)

dotproduct_index = db.create_index(
    name="dotproduct_index",
    dimension=1536,
    metric=Metric.DOTPRODUCT  # Raw dot product, good for comparing direction and magnitude
)

euclidean_index = db.create_index(
    name="euclidean_index",
    dimension=1536,
    metric=Metric.EUCLIDEAN  # Euclidean distance, good for spatial data
)

Filter Types

from vectorstore import (
    FilterTypedDict,  # Base filter type
    AndFilter,        # $and logical operator
    OrFilter,         # $or logical operator
    SimpleFilter,     # Direct field matching
    ExactMatchFilter, # Exact field value matching
    EqFilter,         # $eq comparison
    NeFilter,         # $ne comparison
    GtFilter,         # $gt comparison
    GteFilter,        # $gte comparison
    LtFilter,         # $lt comparison
    LteFilter,        # $lte comparison
    InFilter,         # $in collection operator
    NinFilter         # $nin collection operator
)

# Complex filter example
complex_filter: FilterTypedDict = {
    "$and": [
        {
            "$or": [
                {"category": "article"},
                {"category": "blog"}
            ]
        },
        {"year": {"$gte": 2020}},
        {"author": {"$exists": True}}
    ]
}

results = index.query(
    vector=[0.1, 0.2, 0.3, ...],
    top_k=10,
    filter=complex_filter
)

API Reference

Main Classes

VectorDB: Main entry point for creating and managing vector indexes
IndexInterface: Interface for interacting with a specific index
Vector: Class representing a vector with ID, values, and metadata
IndexModel: Configuration for an index

Enums

Metric: Similarity metrics (COSINE, DOTPRODUCT, EUCLIDEAN)
DeletionProtection: Protection against accidental deletion (ENABLED, DISABLED)

Best Practices

Connection Management:
- Use connection pooling for production applications
- Close connections properly when not using a pool
Vector Indexing:
- Enable vector indexes for large datasets (use_vector_index=True)
- Tune vector_index_options based on dataset size and query patterns
Namespaces:
- Use namespaces to organize vectors by source, type, or domain
- Query across multiple namespaces when relevant
Batch Operations:
- Use batch operations for inserting multiple vectors
- For large datasets, use upsert_from_dataframe with appropriate batch_size
Metrics Selection:
- Cosine similarity is best for direction comparison (most common)
- Dot product works well when magnitude matters
- Euclidean distance is good for spatial data
Deletion Protection:
- Enable deletion protection for production indexes
- Configure indexes properly before adding large amounts of data

Metadata Filtering

VectorStore supports powerful metadata filtering capabilities that let you narrow down vector searches based on their associated metadata.

Filter Types

Simple Equality Filter

# Find vectors where category is exactly "article"
filter = {"category": "article"}

Comparison Operators

# Equal to
filter = {"year": {"$eq": 2023}}

# Not equal to
filter = {"year": {"$ne": 2023}}

# Greater than
filter = {"year": {"$gt": 2020}}

# Greater than or equal to
filter = {"year": {"$gte": 2020}}

# Less than
filter = {"year": {"$lt": 2023}}

# Less than or equal to
filter = {"year": {"$lte": 2023}}

Collection Operators

# Value is in a specified array
filter = {"category": {"$in": ["article", "blog", "news"]}}

# Value is not in a specified array
filter = {"category": {"$nin": ["video", "podcast"]}}

Existence Checks

# Field exists
filter = {"author": {"$exists": True}}

# Field does not exist
filter = {"author": {"$exists": False}}

Logical Operators

# AND - all conditions must match
filter = {
    "$and": [
        {"category": "article"},
        {"year": {"$gte": 2020}}
    ]
}

# OR - at least one condition must match
filter = {
    "$or": [
        {"category": "article"},
        {"category": "blog"}
    ]
}

Combined Complex Filters

# Articles or blogs from 2020 or later that have an author field
filter = {
    "$and": [
        {
            "$or": [
                {"category": "article"},
                {"category": "blog"}
            ]
        },
        {"year": {"$gte": 2020}},
        {"author": {"$exists": True}}
    ]
}

How Filtering Works

Metadata filters are translated into SQL expressions that filter results based on the JSON metadata stored with each vector. The filters are applied before distance calculation for SQL-level filtering, improving query efficiency.

Filter Usage

Filters can be used in multiple operations:

In queries:

results = index.query(
    vector=[0.1, 0.2, 0.3, ...],
    top_k=10,
    filter={"$and": [{"category": "article"}, {"year": {"$gte": 2020}}]}
)

For deletion operations:

# Remove outdated vectors
index.delete(
    filter={"status": "outdated"}
)

For statistical analysis:

# Get statistics for a specific category
stats = index.describe_index_stats(
    filter={"category": "article"}
)

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Roadmap

Future development plans include:

Adding index-for-model support with hybrid search capabilities (combining text and vector embedding searches)