and a "vector database" is a collection of *vectors*, sure. The question comes f...

modernpink · on May 5, 2023

An embedding model will map a string of text (of variable length) to R^D. Each model has its own fixed D, yes. (Typically the vector is a unit vector for performance reasons.) The main function of vector database is similarity lookup so you would calculate approximate nearest neighbours of a vector with a scalar vector distance metric (e.g cosine similarity). These similarity metrics operate on two vectors of the same dimension.

You would not mix and match embedding models (e.g. with differing dimensions) at look-up time. The target vector table assumes you will look it up with a vector created from the exact same embedding model and version that was used to backfill it.

The API documentation for a look up operation may be more illuminating here:

>vector (array of floats)

>The query vector. This should be the same length as the dimension of the index being queried. Each query() request can contain only one of the parameters id or vector.

https://docs.pinecone.io/reference/query

seanhunter · on May 5, 2023

D is fixed for the model but not for the database. You don't need a seperate database for each model.

modernpink · on May 5, 2023

Yes that's imprecision on my part. For a table D is fixed, but not necessarily across tables in the vector database index

seanhunter · on May 5, 2023

As per my comment earlier, the dimension isn't fixed. The usual use case (storing embeddings) is instructive as to the range of values. For token embeddings, often the embedding is generated via a lookup in a fixed vocabulary of tokens token to a token ID. So say your vocab is words, the value is a word id which would obviously be an integer not a real. Here's an intro to word embeddings https://wiki.pathmind.com/word2vec and here's one for positional embeddings (the new hotness given how zeitgeisty GPTs are) https://theaisummer.com/positional-embeddings/