Using OpenAI semantic search with Neo4J

Eranga Dulshan
4 min readJan 6, 2023

--

Photo by Evgeni Tcherkasski on Unsplash

Recently, OpenAI has become the hot topic. I had a database of pdf files that needed to be searched. I was using Neo4J and its lucene search for searching. Every Neo4J node contained the file url and the extracted pdf text.

And wanted to try out the latest OpenAI semantic search. In this blog, I will be discussing the path that I followed and hope it would be helpful for someone who is interested in similar technologies. I have used python and py2neo package for this and I expect the reader has knowledge on python, Neo4j and py2neo (python driver library for Neo4J).

First, I will explain the way OpenAI search works. OpenAI has some pre trained models that returns something called embeddings when text is given as input. Embedding is nothing but a vector with numeric values in 1536 dimensions. If we calculate the cosine similarity of two embedding vectors, from the final value we can decide whether they are related or not. Cosine similarity score vary between -1 and 1. So higher the value better the relatedness. So if we have already have the items that should be text searchable, we have to get those embeddings for those text and save it in the database and when a search query comes we have to get the embedding for the specific search term and then calculate the cosine similarity and sort based on the score (descending order) to get the search results.

You can also check this from OpenAI examples list to further get an idea on semantic search.

Let’s move on to code level.

So you have to first create an OpenAI account and get an API key. The moment you create an account you get free API requests that worth $18. You can access OpenAI here and create an API key here after login. Then with the API key, we can request the embedding vectors for the text in the existing database and save them in the database as an array (In my scenario I saved them in the same Neo4J node).

OpenAI provides few models that we can use to get embeddings. I have used

text-embedding-ada-002

model which is the one OpenAI also recommends to use. Here is the python code to request embedding from the OpenAI.

import logging
import time
import openai

logger = logging.getLogger(__name__)

openai.api_key = OPEN_AI_KEY


def get_embeddings_from_open_ai(text):
cur_time = time.time()
logger.info("INFO::get_embeddings_from_open_ai::starting::")
response = openai.Embedding.create(
input=text,
model="text-embedding-ada-002"
)
logger.info(f"INFO::get_embeddings_from_open_ai::ended::{ time.time() - cur_time }")
embeddings = response['data'][0]['embedding']
return embeddings

When we get a search query, we first request the embedding from the OpenAI for the search query and then we have to calculate the cosine similarity with every stored embedding vector that we have already stored. We have to do this for any search query. One option is to load every embedding array and do the calculation in python application layer. But with Neo4J, we can calculate the cosine similarity at the database level. For that you have to use Neo4J Graph Data Science (GDS) package. Installation instruction can be found here and summary of the process is shown in below screenshot (this is for the Neo4J server and if you are on Neo4J desktop you can find them in the same link above).

Enabling data science package in Neo4J

You can find further examples in here as well for cosine similarity calculation in Neo4J.

Now you can run the query to fetch the results based on cosine similarity. Here is the query that I used.

import time
from py2neo import Graph


def get_neo4j():
global neo4j_instance, uri, secrets
if neo4j_instance is None:
neo4j_instance = Graph(uri, auth=(secrets['neo4j_username'], secrets['neo4j_password']))
return neo4j_instance


def search(search_query):
graph = get_neo4j()
search_query_embedding = get_embeddings_from_open_ai(search_query)
query = """match(a:pdf) with a, gds.alpha.similarity.cosine(a.embeddings, %s) as similarity return a as pdf
order by similarity desc limit %s"""
query = query % (search_query_embedding, 20)
pdfs = graph.run(query).data()
return pdfs

here we simply get the embedding for search query first and the we use the

gds.alpha.similarity.cosine

function from GDS for cosine similarity calculation and sort based on the score.

So when you insert more items to the database (in this case the pdfs), you can simply request embedding for that item’s text and store the embedding vector as an array with it and then it would be searchable.

Please also note that there are limitations to OpenAI API requests, they have the concept of tokens and for one request token count should not be greater than 8191 tokens. You can get an idea about the tokens in this article.

That’s it and happy coding 😃 .

--

--