Comparing text using embeddings#

In this tutorial, you’ll learn how to create a Python application that compares two input texts using semantic embeddings and Levenshtein similarity. The application communicates with the API server to generate embeddings and calculates similarity metrics programmatically.

Embeddings are numerical representations of the meaning of a string of text. Similar strings of text will have similar numbers, and the closer numbers are together, the more closely related the topics are, even if the words are different. For example, the words “vision” and “sight” might have embeddings that are numerically close due to their similar meanings.

Prerequisites
  • Before you begin, ensure that you have the conda package manager installed on your machine. You can install conda using either Anaconda Distribution or Miniconda.

  • You must have a sentence-similarity type model downloaded onto your local machine.

Setting up your environment#

When working on a new conda project, it is recommended that you create a new environment for development. Follow these steps to set up an environment for your embedding application:

  1. Open Anaconda Prompt (Terminal on macOS/Linux).

    Tip

    This terminal can be opened from within an IDE (JupyterLab, PyCharm, VSCode, Spyder), if preferred.

  2. Create the conda environment to develop your embedding application and install the packages you’ll need by running the following command:

    conda create -n content-compare python numpy scikit-learn python-Levenshtein requests
    
  3. Activate your newly created conda environment by running the following command:

    conda activate content-compare
    

For more information and best practices for managing environments, see Environments.

Building the text comparator#

Below, you’ll find the necessary code snippets to build your text comparator, with explanations for each step to help you understand how the application works. The text comparator combines two methods for comparing text: semantic similarity using embeddings, and structural similarity using Levenshtein distance.

Semantic similarity tells us how close the meanings of two texts are, while Levenshtein distance looks at how similar the actual characters are by counting the edits needed to turn one string into the other. Together, these methods help us understand how similar two text strings are—whether they look alike, mean the same thing, or both.

Using your preferred IDE, create a new file and name it similarian.py.

Importing libraries#

The application we are building requires libraries to handle HTTP requests, numerical operations, and string similarity calculations.

Add the following lines of code to the top of your similarian.py file:

import requests
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import Levenshtein

Setting the base_url#

In order for the application to programmatically process text inputs to run server health checks, generate embeddings, and perform other actions, it is crucial that you properly structure your application to interact with the API server and its endpoints.

The URLs for these API endpoints are constructed by combining a base_url with a specific /endpoint for each function. The base_URL can be constructed by combining the Server Address and Server Port specified in Anaconda AI Navigator, like this: http://<SERVER_ADDRESS>:<SERVER_PORT>.

Set the base_url to point to the default server address by adding the following line to your file.

base_url = 'http://localhost:8080'

Tip

localhost and 127.0.0.1 are semantically identical.

Adding the API calls#

AI Navigator utilizes llama.cpp’s specifications for interacting with the API server’s /embedding endpoint.

Tip

The API server is also compatible with OpenAI’s /embeddings API specifications.

To enable your application to communicate with the API server, you must implement functions that make API calls in a way that the server can understand.

GET /health#

Before sending any requests to the server, it’s a good idea to verify that the server is operational. This function sends a GET request to the /health endpoint and returns a JSON response that tells you the server’s status.

Add the following lines to your similarian.py file:

def get_server_health():
    response = requests.get(f'{base_url}/health')
    return response.json()

POST /embedding#

To interact with a sentence-similarity model, you must have a function that hits the server’s /embedding endpoint. This function processes input text and returns its vector representation (embedding).

Add the following lines to your similarian.py file:

def get_embedding(input_text):
    data = {"content": input_text}
    headers = {"Content-Type": "application/json"}
    response = requests.post(f"{base_url}/embedding", json=data, headers=headers)

    if response.status_code == 200:
        return response.json()["embedding"]
    else:
        raise Exception(f"Error: {response.status_code}, {response.text}")

Constructing the functions#

Now that we have added the API calls to communicate with the API server, we’ll need to construct the core functionality of our application: comparing two strings of text. This involves measuring their semantic (meaning-based) and structural (character-based) similarities.

compare_texts#

This function takes the two text inputs from the main function and calculates the semantic and structural similarity scores.

Add the following lines to your similarian.py file:

def compare_texts(text1, text2):
    # Get embeddings and calculate semantic similarity
    emb1 = get_embedding(text1)
    emb2 = get_embedding(text2)
    semantic_sim = cosine_similarity(
        np.array(emb1).reshape(1, -1),
        np.array(emb2).reshape(1, -1)
    )[0][0]

    # Calculate Levenshtein similarity
    distance = Levenshtein.distance(text1.lower(), text2.lower())
    max_length = max(len(text1), len(text2))
    levenshtein_sim = 1 - (distance / max_length)

    return semantic_sim, levenshtein_sim

main#

The main function ties the rest of the functions together and handles user input. It takes two inputs from the user and displays the results from the similarity calculations.

Add the following lines to your similarian.py file:

def main():
    print("Enter two sentences to compare:")
    text1 = input("Sentence 1: ")
    text2 = input("Sentence 2: ")

    print("\nCalculating similarities...")

    try:
        semantic_sim, levenshtein_sim = compare_texts(text1, text2)
        print("\nResults:")
        print(f"Semantic (embedding) similarity: {semantic_sim:.2%}")
        print(f"Levenshtein (string) similarity: {levenshtein_sim:.2%}")
    except requests.exceptions.ConnectionError:
        print("\nError: Could not connect to embedding server at localhost:8080")
        print("Make sure your local embedding server is running")

if __name__ == "__main__":
    main()

Interacting with the API server#

With your text comparator constructed, it’s time to compare some text!

  1. Open Anaconda AI Navigator and load a model into the API server.

    Note

    This must be a sentence-similarity type model!

  2. Leave the Server Address and Server Port at the default values and click Start.

  3. Open a terminal and navigate to the directory where you stored your similarian.py file.

    Tip

    Make sure you are still in your content-compare conda environment.

  4. Initiate the text comparator by running the following command:

    python similarian.py
    

    Note

    You’ll need to run this command every time you want to run the text comparator.

  5. Enter a string of text and press Enter (return on macOS).

  6. Enter a string of text that you want to compare to the previous string and press Enter again.

  7. View the Anaconda AI Navigator API server logs. If everything is set up correctly, the server logs will populate with traffic from the application, starting with a health check.

Here is an example of interacting with the text compartor, assuming you’ve previously navigated to the directory containing your similarian.py file:

(content-compare) ➜ python similarian.py
Sentence 1: That book was amazing! I did not see that twist coming!
Sentence 2: That tome was naught but wondrous! The twist therein didst elude mine keenest foresight.

Calculating similarities...

Results:
Semantic (embedding) similarity: 72.61%
Levenshtein (string) similarity: 31.82%

Comparing sentences#

Here are some examples that you can use to get a better understanding and feel for how text comparisons work:

Synonyms and rephrasing

Try writing the same phrase in two different ways to see how the semantic meaning similarity remains high, even when the structural similarity differs greatly.

The quick brown fox jumped over the lazy dog.
The swift auburn fox leaped above the sluggish canine.
Effects of typos

Experiment with minor typos to observe how semantic similarity remains consistent, while structural similarity drops due to increased edit distance.

The quick brown fox jumps over the lazy dog.
The quikc brn fox jumpd ovr the lazy dog.
Opposite meanings

Compare sentences that are structurally similar but have opposite meanings. This highlights how semantic similarity can drop even when Levenshtein similarity remains high.

I love dogs; they are the best.
I hate dogs; they are the worst.

Next steps#

You can continue to develop and extend this text comparator to tackle more advanced use cases, such as implementing a database to store embeddings for efficient comparisons at scale, allowing you to build tools like duplicate content detectors, recommendation systems, or document clustering applications.

Or, if you’re finished with this project, you can delete the file and clean up your conda environment by running the following commands:

conda deactivate
conda remove -n content-compare --all