Skip to main content
LanceDB handles multimodal data—images, audio, video, and PDF files—natively by storing the raw bytes in a binary column alongside your vectors and metadata. This approach simplifies your data infrastructure by keeping the raw assets and their embeddings in the same database, eliminating the need for separate object storage for many use cases. This guide demonstrates how to ingest, store, and retrieve image data using standard binary columns, and also introduces the Lance Blob API for optimized handling of larger multimodal files.

Storing Binary Data

To store binary data, you need to use the pa.binary() data type in your Arrow schema. In Python, this corresponds to bytes objects if you’re using LanceDB’s Pydantic LanceModel to define schema.

1. Setup and Imports

First, let’s import the necessary libraries. We’ll use PIL (Pillow) for image handling and io for byte conversion.

2. preparing Data

For this example, we’ll create some dummy in-memory images. In a real application, you would read these from files or an API. The key is to convert your data (image, audio, etc.) into a raw bytes object.

3. Defining the Schema

When creating the table, it is highly recommended to define the schema explicitly. This ensures that your binary data is correctly interpreted as a binary type by Arrow/LanceDB and not as a generic string or list.

4. Ingesting Data

Now, create the table using the data and the defined schema.

Retrieving and Using Blobs

When you search your LanceDB table, you can retrieve the binary column just like any other metadata.

Converting Bites Back to Objects

Once you have the bytes data back from the search result, you can decode it back into its original format (e.g., a PIL Image, an Audio buffer, etc.).

Large Blobs (Blob API)

For larger files like high-resolution images or videos, Lance provides a specialized Blob API. By using pa.large_binary() and specific metadata, you enable lazy loading and optimized encoding. This allows you to work with massive datasets without loading all binary data into memory upfront.

1. Defining a Blob Schema

To use the Blob API, you must mark the column with {"lance-encoding:blob": "true"} metadata.

2. Ingesting Large Blobs

You can then ingest data normally, and Lance will handle the optimized storage. For more advanced usage, including random access and file-like reading of blobs, see the Lance Blob API documentation.

Other Modalities

The pa.binary() and pa.large_binary() types are universal. You can use this same pattern for other types of multimodal data:
  • Audio: Read .wav or .mp3 files as bytes.
  • Video: Store video transitions or full clips using the Blob API.
  • PDFs/Documents: Store the raw file content for document search.