A Coding Implementation to Construct a Doc Search Agent (DocSearchAgent) with Hugging Face, ChromaDB, and Langchain

In right now’s information-rich world, discovering related paperwork rapidly is essential. Conventional keyword-based search techniques usually fall brief when coping with semantic that means. This tutorial demonstrates methods to construct a strong doc search engine utilizing:

Hugging Face’s embedding fashions to transform textual content into wealthy vector representations

Chroma DB as our vector database for environment friendly similarity search

Sentence transformers for high-quality textual content embeddings

This implementation allows semantic search capabilities – discovering paperwork primarily based on that means quite than simply key phrase matching. By the top of this tutorial, you’ll have a working doc search engine that may:

Course of and embed textual content paperwork

Retailer these embeddings effectively

Retrieve probably the most semantically related paperwork to any question

Deal with a wide range of doc sorts and search wants

Please observe the detailed steps talked about under in sequence to implement DocSearchAgent.

First, we have to set up the mandatory libraries.

!pip set up chromadb sentence-transformers langchain datasets

Let’s begin by importing the libraries we’ll use:

import os
import numpy as np
import pandas as pd
from datasets import load_dataset
import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
import time

For this tutorial, we’ll use a subset of Wikipedia articles from the Hugging Face datasets library. This provides us a various set of paperwork to work with.

dataset = load_dataset(“wikipedia”, “20220301.en”, cut up=”practice[:1000]”)
print(f”Loaded {len(dataset)} Wikipedia articles”)

paperwork = []
for i, article in enumerate(dataset):
doc = {
“id”: f”doc_{i}”,
“title”: article[“title”],
“textual content”: article[“text”],
“url”: article[“url”]
}
paperwork.append(doc)

df = pd.DataFrame(paperwork)
df.head(3)

Now, let’s cut up our paperwork into smaller chunks for extra granular looking:

text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
)

chunks = []
chunk_ids = []
chunk_sources = []

for i, doc in enumerate(paperwork):
doc_chunks = text_splitter.split_text(doc[“text”])
chunks.prolong(doc_chunks)
chunk_ids.prolong([f”chunk_{i}_{j}” for j in range(len(doc_chunks))])
chunk_sources.prolong([doc[“title”]] * len(doc_chunks))

print(f”Created {len(chunks)} chunks from {len(paperwork)} paperwork”)

We’ll use a pre-trained sentence transformer mannequin from Hugging Face to create our embeddings:

model_name = “sentence-transformers/all-MiniLM-L6-v2”
embedding_model = SentenceTransformer(model_name)

sample_text = “It is a pattern textual content to check our embedding mannequin.”
sample_embedding = embedding_model.encode(sample_text)
print(f”Embedding dimension: {len(sample_embedding)}”)

Now, let’s arrange Chroma DB, a light-weight vector database good for our search engine:

chroma_client = chromadb.Consumer()

embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)

assortment = chroma_client.create_collection(
title=”document_search”,
embedding_function=embedding_function
)

batch_size = 100
for i in vary(0, len(chunks), batch_size):
end_idx = min(i + batch_size, len(chunks))

batch_ids = chunk_ids[i:end_idx]
batch_chunks = chunks[i:end_idx]
batch_sources = chunk_sources[i:end_idx]

assortment.add(
ids=batch_ids,
paperwork=batch_chunks,
metadatas=[{“source”: source} for source in batch_sources]
)

print(f”Added batch {i//batch_size + 1}/{(len(chunks)-1)//batch_size + 1} to the gathering”)

print(f”Complete paperwork in assortment: {assortment.depend()}”)

Now comes the thrilling half – looking via our paperwork:

def search_documents(question, n_results=5):
“””
Seek for paperwork just like the question.

Args:
question (str): The search question
n_results (int): Variety of outcomes to return

Returns:
dict: Search outcomes
“””
start_time = time.time()

outcomes = assortment.question(
query_texts=[query],
n_results=n_results
)

end_time = time.time()
search_time = end_time – start_time

print(f”Search accomplished in {search_time:.4f} seconds”)
return outcomes

queries = [
“What are the effects of climate change?”,
“History of artificial intelligence”,
“Space exploration missions”
]

for question in queries:
print(f”nQuery: {question}”)
outcomes = search_documents(question)

for i, (doc, metadata) in enumerate(zip(outcomes[‘documents’][0], outcomes[‘metadatas’][0])):
print(f”nResult {i+1} from {metadata[‘source’]}:”)
print(f”{doc[:200]}…”)

Let’s create a easy operate to offer a greater person expertise:

def interactive_search():
“””
Interactive search interface for the doc search engine.
“””
whereas True:
question = enter(“nEnter your search question (or ‘give up’ to exit): “)

if question.decrease() == ‘give up’:
print(“Exiting search interface…”)
break

n_results = int(enter(“What number of outcomes would you want? “))

outcomes = search_documents(question, n_results)

print(f”nFound {len(outcomes[‘documents’][0])} outcomes for ‘{question}’:”)

for i, (doc, metadata, distance) in enumerate(zip(
outcomes[‘documents’][0],
outcomes[‘metadatas’][0],
outcomes[‘distances’][0]
)):
relevance = 1 – distance
print(f”n— Consequence {i+1} —“)
print(f”Supply: {metadata[‘source’]}”)
print(f”Relevance: {relevance:.2f}”)
print(f”Excerpt: {doc[:300]}…”)
print(“-” * 50)

interactive_search()

Let’s add the power to filter our search outcomes by metadata:

def filtered_search(question, filter_source=None, n_results=5):
“””
Search with non-compulsory filtering by supply.

Args:
question (str): The search question
filter_source (str): Optionally available supply to filter by
n_results (int): Variety of outcomes to return

Returns:
dict: Search outcomes
“””
where_clause = {“supply”: filter_source} if filter_source else None

outcomes = assortment.question(
query_texts=[query],
n_results=n_results,
the place=where_clause
)

return outcomes

unique_sources = record(set(chunk_sources))
print(f”Obtainable sources for filtering: {len(unique_sources)}”)
print(unique_sources[:5])

if len(unique_sources) > 0:
filter_source = unique_sources[0]
question = “major ideas and rules”

print(f”nFiltered seek for ‘{question}’ in supply ‘{filter_source}’:”)
outcomes = filtered_search(question, filter_source=filter_source)

for i, doc in enumerate(outcomes[‘documents’][0]):
print(f”nResult {i+1}:”)
print(f”{doc[:200]}…”)

In conclusion, we reveal methods to construct a semantic doc search engine utilizing Hugging Face embedding fashions and ChromaDB. The system retrieves paperwork primarily based on that means quite than simply key phrases by remodeling textual content into vector representations. The implementation processes Wikipedia articles chunks them for granularity, embeds them utilizing sentence transformers, and shops them in a vector database for environment friendly retrieval. The ultimate product options interactive looking, metadata filtering, and relevance rating.

Right here is the Colab Pocket book. Additionally, don’t neglect to observe us on Twitter and be a part of our Telegram Channel and LinkedIn Group. Don’t Overlook to affix our 80k+ ML SubReddit.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.

Source link

A Coding Implementation to Construct a Doc Search Agent (DocSearchAgent) with Hugging Face, ChromaDB, and Langchain

On the core of problem-solving | MIT Information

How to Save Up to 20% on New Construction Homes

How to Save Up to 20% on New Construction Homes

Leave a Reply Cancel reply

Categories

Recent News