ethanahte's comments

ethanahte · on April 14, 2023

The answer was at least "maybe"!

ethanahte · on April 13, 2023

Hi, author here. I totally agree with you that, for large scale, you're going to need a vector database. My hope is more to help people avoid scenarios like the one in this comment: https://news.ycombinator.com/item?id=35552303 Tangentially, I really like the approach that haystack has taken, where they allow you to slot in whichever document store you want, and that document store can scale from in-memory, to sqlite, to postgres, to pinecone https://docs.haystack.deepset.ai/docs/document_store

In terms of the one-time cost of indexing, you're totally right! Although, one thing to call out is that you will have to re-index every time you change your embedding model, such as for fine-tuning. I don't have a good handle on how prevalent this is, though.

antti909 · on April 13, 2023

Thanks for mentioning Haystack :)

ethanahte · on April 14, 2023

It's a nice project that was ahead of its time! I hope it can successfully ride the current hype wave :D

ethanahte · on April 13, 2023

Hi, author here.

1. You make a great point about longer documents requiring multiple vectors which I should've mentioned in the post. Depending on your use case, this can certainly explode your dataset size! 2. Good to know about the pgvector limitations -- I haven't used it yet. 3. I guess "index" would be the more database-y term. That said, one thing I'll call out is that you have to re-index if you ever change your embedding model, and indexing can be slow. It took me ~20-30 minutes to index the 10 million embeddings in my benchmark.

kordlessagain · on April 13, 2023

I'm interested if anyone has some hard data on the "best" size of the document "fragments" that are used for embedding into a dense vector.

Obviously, embedding single words probably aren't particularly useful for reassembling portions of a document for submission to an LLM in the prompt. I'm currently pondering on what size of string is best for embedding, and considering a variable size might be one option.

Testing with strings around 512 characters seem to do pretty well, but it may be storing multiple lengths of similar runs in the document might be a better way to do it.

rahimnathwani · on April 13, 2023

This will depend on the specific model you're using, because:

- if a model has been trained on shorter paragraphs, it will likely do better on those than on longer ones, and vice versa

- each model has some maximum input length (e.g. 512 tokens, or about 350 words), and might silently discard words when it's given a longer chunk

I don't know whether or not processing multiple lengths is worthwhile, but you probably want to have some overlap when you turn your docs into chunks.

Maybe take a look at Langchain or LlamaGPT: someone has probably come up with sensible defaults for overlap and chunk size.

If you want to do embeddings locally, check out sentence-transformers/all-MiniLM-L6-v2

rahimnathwani · on April 13, 2023

On your last point: I guess recalculating 10 million embeddings takes much longer than the 20-30 mins to re-index?

Or perhaps we care because calculating the embeddings can be done in parallel with no limit, but the indexing is somehow constrained?

ethanahte · on April 14, 2023

Yeah, depending on the model, calculating the 10 million embeddings could take longer sequentially, but, as you mention, it's also an embarrassingly parallel operation. I don't think that indexing can be performed in parallel, but I may be wrong on that one.

ethanahte · on March 1, 2017

Dia&Co | Software Engineer, Product Manager, Data Scientist, and Data Analyst | New York, NY | Full-time, ONSITE, REMOTE

Dia&Co is the premier personal styling service for plus-size women. We’re looking for engineers, product, and data people to help create our suite of large consumer-facing and internal products that are transforming both operational efficiency and consumer e-commerce. We work with Ruby on Rails on the engineering side and Python on the data science side.

Please check out our tech blog to get an idea of what we think about and value: https://making.dia.com/

The interview process is a phone screen, a take home coding challenge, and finally an on-site interview. Apply here, and let us know that you found us on Hacker News: https://www.dia.co/careers

ethanahte · on Feb 1, 2017

Dia&Co | New York City or REMOTE | Software Engineer, Product Manager, Data Scientist, and Data Analyst | Full-time

Dia&Co is the premier personal styling service for plus-size women. We’re looking for engineers, product, and data people to help create our suite of large consumer-facing and internal products that are transforming both operational efficiency and consumer e-commerce. We work with Ruby on Rails on the engineering side and Python on the data science side.

Please check out our tech blog to get an idea of what we think about and value: https://making.dia.com/

The interview process is a phone screen, a take home coding challenge, and finally an on-site interview. Apply here, and let us know that you found us on Hacker News: https://www.dia.co/careers

ethanahte · on Dec 1, 2016

Dia&Co | New York City or REMOTE | Software Engineer, Product Manager, and Data Scientist | Full-time Dia&Co is the premier personal styling service for plus-size women. We’re looking for software engineers, product managers, and data scientists to help create our suite of large consumer-facing and internal products that are transforming both operational efficiency and consumer e-commerce. We work with Ruby on Rails on the engineering side and Python on the data science side. The interview process is a phone screen, a take home coding challenge, and finally an on-site interview. Apply here, and let us know that you found us on Hacker News: https://www.dia.co/careers

ethanahte · on Nov 1, 2016

Dia&Co | New York City or REMOTE | Software Engineer, Product Manager, and Data Scientist | Full-time

Dia&Co is the premier personal styling service for plus-size women.

We’re looking for software engineers, product managers, and data scientists to help create our suite of large consumer-facing and internal products that are transforming both operational efficiency and consumer e-commerce.

We work with Ruby on Rails on the engineering side and Python on the data science side.

The interview process is a phone screen, a take home coding challenge, and finally an on-site interview.

Apply here, and let us know that you found us on Hacker News: https://www.dia.co/careers