I. Introduction to RAG and Precautions
A. Concept of RAG
"Retrieval-Augmented Generation" (RAG) is a method that includes useful information in LLM prompts.
B. Implement RAG with Caution
Before considering RAG, organizations should have certain data infrastructures, such as logs, ETL, analytics, and dashboards. The absence of a data warehouse may mean that it is not yet suitable to implement RAG.
II. Steps to Build a RAG Pipeline
A. Determine How to Measure Results
Since RAG has costs, it should be ensured to improve results, so the first thing to consider is how to measure results. This is a difficult problem, and you can refer to Hamel's evaluation article.
B. Test Hypotheses
Establish a baseline: Establish a baseline with some prompts (preferably from real users) to understand how the system processes, and use the results as an evaluation set. Run multiple times to understand reliability.
Experimental prompts: Think about how to improve results by adding ideal context without building. If it cannot be improved, there is no need to build a RAG pipeline.
Build and test: After having specific goals, build a RAG pipeline and test whether it improves the response. You can use synthetic data to establish recall and accuracy metrics.
C. Use Text Embeddings to Search for Data
Collect and chunk documents: Collect documents in an easily accessible place and chunk them according to use cases.
Calculate embedding vectors: Calculate embedding vectors for each chunk.
Store and index: Store the chunks and their embedding vectors in the database and build an index for finding the nearest neighbor vectors.
Query and update: When querying, calculate the embedding vector of the query, retrieve the nearest matches and include them in the LLM prompt, while considering how to regularly update the index.
III. Initial Multi-System Implementation
A. Data Preprocessing
Preprocess data to improve information quality and protect privacy. For example, replace login credentials in tasks with variables and aggregate related information.
B. Calculate Text Embeddings
Use OpenAI's embedding endpoint to calculate embedding vectors for each text and add an ID column (generated by the SHA256 hash of the prompt).
C. Retrieve Related Test Steps
Upload embedding vectors: Create an index with Pinecone and upload embedding vectors and metadata.
Query matches: Based on user prompts, use Pinecone's index to find the three closest matches and map them to the operations performed by testers.
This implementation involves three systems: OpenAI, Pinecone, and BigQuery, and has some disadvantages, such as the need to cooperate with multiple providers, data transfer across multiple locations, and high latency costs.
IV. New BigQuery-Only Implementation
A. New Features of BigQuery
BigQuery now allows calculating text embeddings, creating vector indexes, and performing nearest neighbor vector searches, which can replace Pinecone and OpenAI's embedding API, simplifying the system.
B. Preprocessing
Define preprocessing logic in dbt using user-defined functions (UDFs), which can be called during building and querying.
C. Calculate Text Embeddings
Use dbt to create a new data pipeline model that reflects the previous table structure, but now entirely within BigQuery.
D. Retrieve Related Test Steps
Quickly find the nearest neighbors in BigQuery through VECTOR_SEARCH, and BigQuery will automatically manage the index. Be aware of quota issues when querying.
V. Latency and Cost Comparison
A. Latency
The latency of running the entire pipeline in BigQuery is basically the same as the previous OpenAI→Pinecone→BigQuery system, which is about 4 seconds. The latency has little impact on the application, and caching can be added to both.
B. Cost
The cost of the two solutions is roughly the same, with each retrieval costing about $0.01. Although the new pipeline has no advantage in latency and cost, handling a single system simplifies development work, and you can also get free continuous updates of related datasets.
VI. Embedding or Keyword Search
A. Keyword Search
Keyword search is a classic method for retrieving related documents from a collection. It is simple and mature, but there are problems such as not handling synonyms. Any data warehouse has full-text search capabilities, such as BigQuery's SEARCH function, but it may miss some related documents. Using BM25 sorting can improve this.
B. Hybrid Search
You can combine the results of keyword and embedding searches, or use Cohere Reranking to select the most relevant blocks.
VII. Summary
A. Review and Optimization
If you already have a RAG pipeline, you can review its construction method and consider simplifying it. At the beginning, choose the simplest method, and then consider optimization.
B. Data Warehouse Selection
For some organizations, using an existing data warehouse may be the simplest solution, but it is not suitable for all organizations. The author's team is pleased to be able to quickly build and optimize the RAG pipeline, which is closely integrated with the regular data stack, reducing concerns about index updates and maintenance. At the same time, the article introduces how other data warehouses (Snowflake, AWS, Azure) can implement similar functions.