When building RAG applications, it’s crucial to consider the limitations of the initial embedding. The retrieved documents are pulled using cosine-similarity with the vector created from the question, which can restrict the context based on the quality of the vector and the archived vectors. However, there’s a better way. In my experience, widening the cosine similarity filter, bringing in more context, and using a reranker (such as the one from Cohere), can generate better responses and make us less limited by the initial embedding. A model, after all, should be more powerful than a scaled dot-product.