Metadata Indexing in Realtime with OpenIO and Elasticsearch

With OpenIO, when a new file is uploaded, you can index all its metadata in Elasticsearch. You will then be able to query Elasticsearch for all the files matching any type of metadata. A step by step guide.
Guillaume Delaporte
Guillaume Delaporte
VP Sales at OpenIO

It’s time for the second article in our series about GridForApps, the event-driven framework that is part of our object storage solution. If you missed our first article, you can read A technical introduction to Grid for Apps.

The first article discussed enriching object metadata by adding a new metadata tag to files on the fly. We can think of many useful types of metadata that we can add to an object. For example, it could be a license plate extracted from a speed control camera picture, a document type, the resolution and bit rate of a video, or a pattern found in a picture. But let’s answer a question first: since a software-defined object storage solution can store petabytes of data and billions of files, how can you make use of it? How can you find the specific objects you are looking for?
This is actually easy with OpenIO: when a new file is uploaded, you can index all its metadata in Elasticsearch.  You will then be able to query Elasticsearch for all the files matching any type of metadata.

Let's do it!

As in our previous article, we will use our Docker container image to easily spawn an OpenIO SDS environment. We will also use the Elasticsearch Docker image to deploy it.

Retrieve the latest Elasticsearch Docker image (5.4.0 as of this writing):

# docker pull

And start an Elasticsearch instance:

# docker run -d -p 9200:9200 -e "" -e ""

Retrieve the OpenIO SDS Docker image:

# docker pull openio/sds

Start your new OpenIO SDS environment:

# docker run -ti --tty openio/sds

You should now be at the prompt with an OpenIO SDS instance up and running.

Next, we will configure the trigger, so that each time you add a new object, the metadata from the object will be pushed to Elasticsearch. Add the following content to the file /etc/oio/sds/OPENIO/oio-event-agent-0/oio-event-handlers.conf:

pipeline = process

pipeline = content_cleaner

pipeline = account_update

pipeline = account_update

pipeline = account_update

pipeline = volume_index

pipeline = volume_index

use = egg:oio#content_cleaner

use = egg:oio#account_update

use = egg:oio#volume_index

use = egg:oio#notify
tube = oio-process
queue_url = beanstalk://

If you want to learn more about this configuration file, please refer to our previous blog post.

Then, restart the openio event agent to enable the modification:

# gridinit_cmd restart @oio-event-agent

Your event-driven system is now up and running. The next step is to write the script that will index the metadata into Elasticsearch. To do so, we first need to install the Elasticsearch python module:

# yum install python-elasticsearch.noarch

And we can now write the script. Let’s call it

#!/usr/bin/env python
import json
from oio.api import object_storage
from oio.event.beanstalk import Beanstalk, ResponseError
from elasticsearch import Elasticsearch
# Initiate a connection to beanstalk to fetch the events from the tube oio-process
b = Beanstalk.from_url("beanstalk://")"oio-process")
# Waiting for events
while True:
# Reserve the event when it appears
        event_id, data = b.reserve()
except ResponseError:
# Or continue waiting for the next one
print event_id
# Retrieve the information from the event (namespace, bucket, object name ...)
meta = json.loads(data)
print meta
url = meta["url"]
# Initiate a connection with the OpenIO cluster
s = object_storage.ObjectStorageAPI(url["ns"], "")

# Create index for metadata
print "indexing"
# Open a connection to Elasticsearch
# /! Change the ip /!
es = Elasticsearch(['http://elastic:changeme@'])

# Retrieve the metadata from the object
meta, stream = s.object_fetch(url["account"], url["user"], url["path"])
# Create the index in ElasticSearch if it does not exist
if not es.indices.exists(url["account"]):

# Push the metadatas to Elasticsearch
res = es.index(index=url["account"], doc_type=url["user"], body=meta)

# Delete the event

You will have to modify the IP address of the Elasticsearch instance. In my case, the IP address of my machine was Change it according to your environment.

Finally, launch the script in background:

# python

Please note that the script is written in Python, but you can write it in any other language.

How does it work?

It’s time to add a new object to see if this works.

Using the OpenIO CLI, let’s upload the new object /etc/fstab to the container mycontainer in the account myaccount. We will also add the metadata type=configfile that will help to search for it in Elasticsearch.

# openio --oio-ns OPENIO --oio-account myaccount object create mycontainer /etc/fstab --property type=configfile

Well done! You’ve just uploaded the file fstab, while, in the background, its metadata was indexed in Elasticsearch.

Now, we’ll query Elasticsearch, asking it to find all the objects that match the property configfile:

# curl -XPOST 'http://elastic:changeme@' -d 
  "query": {
    "multi_match" : {
      "query":    "configfile",
      "fields": [ "properties.*" ]

Searching for objects with the property ("fields": [ "name", "properties.*"]) configfile ("query": "configfile"), we obtain the following result:

  "took" : 147,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
        "_index" : "myaccount",
        "_type" : "mycontainer",
        "_id" : "AVv0rLyke_i5VY9AuEZo",
        "_score" : 0.2876821,
        "_source" : {
          "hash" : "FB2B5EC6E6BC56CF7D02BE2B3D4AA5BA",
          "ctime" : "1494458539",
          "deleted" : "False",
          "container_id" : "594C8B26EA13E562391013AE6FC360C2C1691F314164DD457EF583B16712E360",
          "properties" : {
            "type" : "configfile"
          "length" : "313",
          "hash_method" : "md5",
          "chunk_method" : "plain/nb_copy=1",
          "version" : "1494458538343358",
          "policy" : "SINGLE",
          "ns" : "OPENIO",
          "id" : "B43B4FBE334F05006C1396B21200CE3B",
          "mime_type" : "application/octet-stream",
          "name" : "fstab"

All right, our newly uploaded file was detected by Elasticsearch as matching the request "query": "configfile".

Want to know more about OpenIO?

OpenIO SDS is available for testing in four different flavors: Linux packages, the Docker image, and Raspberry Pi.

Stay in touch with us and our community through Twitter and our Slack community channel, to receive the latest info, support, and to chat with other users.

Guillaume Delaporte
Guillaume Delaporte
VP Sales at OpenIO
Guillaume has extensive experience in building and running large storage platforms, which he gained as system engineer and project leader at Atos Worldline, before co-founding OpenIO in 2015.
All posts by Guillaume