May 15th, 2011

Search Your Gmail Messages with ElasticSearch and Ruby

If you’d like to check out ElasticSearch, there’s already lots of options where to get the data to feed it with. You can use a Twitter or Wikipedia river to fill it with gigabytes of public data, or you can feed it very quickly with some RSS feeds.

But, let’s get a bit personal, shall we? Let’s feed it with your own e-mail, imported from your own Gmail account.

We’ll use couple of Ruby gems: Gmail to fetch the e-mail data, Tire to put them into ElasticSearch and search them, and Sinatra to create a simple web application, which will allow us to search the messages. You can see it displayed below.

Your Gmail in ElasticSearch

First of all, download or clone the source code from this gist. If you have ElasticSearch, Ruby and Rubygems, install all the required gems with the Bundler gem:

$ bundle install

We’ll import the data with the gmail-import.rb script. You must provide it your Gmail credentials, like this:

$ ruby gmail-import.rb user@gmail.com yourpassword

Leave the script running in a terminal session, and launch the provided web application in another one, passing it the your Gmail account name:

$ INDEX=user@gmail.com ruby gmail-server.rb

You should see your own e-mail displayed at http://localhost:4567/. Make sure to check out all of the rich Lucene query syntax.

Of course, you’re not limited to search. With ElasticSearch facets, you can pull interesting stuff out of your data, such as getting statistics on who’s sending you the most e-mail:

$ curl -X POST "http://localhost:9200/user@gmail.com/message/_search?pretty=true" -d '
    {
      "facets" : {
        "senders" : { "terms" : { "field" : "from.exact" } }
      },
      "size" : 0
    }
  '

It’s definitely noreply@github.com in my case :) Your data are available in the http://localhost:9200/user@gmail.com/_search?pretty=true&q=* index.

The full source code is available below.

blog comments powered by Disqus