Storing Binary Data
To store binary data, you need to use thepa.binary() data type in your Arrow schema. In Python, this corresponds to bytes objects if you’re using LanceDB’s Pydantic LanceModel to define schema.
1. Setup and Imports
First, let’s import the necessary libraries. We’ll usePIL (Pillow) for image handling and io for byte conversion.
2. preparing Data
For this example, we’ll create some dummy in-memory images. In a real application, you would read these from files or an API. The key is to convert your data (image, audio, etc.) into a rawbytes object.
3. Defining the Schema
When creating the table, it is highly recommended to define the schema explicitly. This ensures that your binary data is correctly interpreted as abinary type by Arrow/LanceDB and not as a generic string or list.
4. Ingesting Data
Now, create the table using the data and the defined schema.Retrieving and Using Blobs
When you search your LanceDB table, you can retrieve the binary column just like any other metadata.Converting Bites Back to Objects
Once you have thebytes data back from the search result, you can decode it back into its original format (e.g., a PIL Image, an Audio buffer, etc.).
Large Blobs (Blob API)
For larger files like high-resolution images or videos, Lance provides a specialized Blob API. By usingpa.large_binary() and specific metadata, you enable lazy loading and optimized encoding. This allows you to work with massive datasets without loading all binary data into memory upfront.
1. Defining a Blob Schema
To use the Blob API, you must mark the column with{"lance-encoding:blob": "true"} metadata.
2. Ingesting Large Blobs
You can then ingest data normally, and Lance will handle the optimized storage. For more advanced usage, including random access and file-like reading of blobs, see the Lance Blob API documentation.Other Modalities
Thepa.binary() and pa.large_binary() types are universal. You can use this same pattern for other types of multimodal data:
- Audio: Read
.wavor.mp3files as bytes. - Video: Store video transitions or full clips using the Blob API.
- PDFs/Documents: Store the raw file content for document search.