Connect to Elasticsearch with Java and RestHighLevelClient

Connecting to Elasticsearch and executing queries is generally considered easy. In fact, in many cases a simple curl query like this one is all you need:

curl http://localhost:9200/_search?q=mySearch

Though, if you want something more than just finding all documents that contain mySearch (as per the example above), you may get lost in the abundance of options and different ways of doing it.

In this article we'll go through the following scenario for searching documents in Elasticsearch:

Filter a few keywords that must be present in the results;
Filter a few keywords that must NOT be present in the results;
Set a specific time range for the search;
Filter the output only to the fields in which we are interested.

To accomplish the above we'll use Java and the official Elasticsearch RestHighLevelClient API.

Setting up the environment

The RestHighLevelClient is the currently recommended client for integrating Elasticsearch in an application. It's available for different programming languages and in this article we'll use the Java version.

Let's start by creating a new Maven project.

mvn -B archetype:generate -DarchetypeGroupId=org.apache.maven.archetypes -DgroupId=com.knowledgebasement.elasticsearch.example -DartifactId=elasticsearch

Go inside the newly created elasticsearch directory in your workspace and open the pom.xml file. There add a dependency for Elasticsearch RestHighLevelClient.

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
    <version>7.0.0</version>
</dependency>

Either resolve the dependencies manually (mvn dependency:resolve) or leave it to your IDE.

Next, ensure that Elasticsearch is installed and started. You can follow the official getting starting guide for this purpose. At the time of writing this article the latest Elasticsearch version is 7.

Loading the example data

To demonstrate how the code works and to test, we'll need some sample data. The sample data should be big enough to allow us to test different scenarios and filtering options. For example, the the logstash logs are a good choice.

Create their indices (for each day, the logs have a different index) logs like this:

curl -X PUT "localhost:9200/logstash-2015.05.18" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "geo": {
        "properties": {
          "coordinates": {
            "type": "geo_point"
          }
        }
      }
    }
  }
}
'

curl -X PUT "localhost:9200/logstash-2015.05.19" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "geo": {
        "properties": {
          "coordinates": {
            "type": "geo_point"
          }
        }
      }
    }
  }
}
'

curl -X PUT "localhost:9200/logstash-2015.05.20" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "geo": {
        "properties": {
          "coordinates": {
            "type": "geo_point"
          }
        }
      }
    }
  }
}
'

Then download the json file with the datasets and extract it:

curl -O https://download.elastic.co/demos/kibana/gettingstarted/7.x/logs.jsonl.gz
 && gunzip logs.jsonl.gz

Finally, load the data:

curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/_bulk?pretty' --data-binary @logs.jsonl

To ensure that you have the dataset loaded correctly, execute a simple query for the keyword brewster-shaw as follows:

curl 'http://localhost:9200/logstash-2015.05.20/_search?pretty&q=brewster-shaw'

Event though, the keyword is relatively unique and it's against a specific index, you will see a long output with 44 result.

From this point on our challenge will be to filter these results in a way that allows us to work easily with this otherwise bulky data.

Using the Java RestHighLevelClient

The following example class App demonstrates how to use the RestHighLevelClient. Ensure that you have resolved its dependency first before being able to use it as explained above in Setting up the environment.

If you have followed the steps so far, the class should be created in elasticsearch/src/main/java/com/knowledgebasement/elasticsearch/example/App.java as follows:

package com.knowledgebasement.elasticsearch.example;

import java.io.IOException;

import org.apache.http.HttpHost;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.index.query.BoolQueryBuilder;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.builder.SearchSourceBuilder;

public class App {

	final private static String[] FETCH_FIELDS = { "@timestamp", "@message" };

	final private static String MATCH_FIELD = "@message";
	final private static String[] MUST_MATCH = { "brewster-shaw", "AppleWebKit" };
	final private static String[] MUST_NOT_MATCH = { "21.211.33.63" };

	final private static String TIME_FIELD = "@timestamp";
	final private static String START_TIME = "2015-05-20T13:06:50";
	final private static String END_TIME = "2025-05-06T00:00:00";

	final private static String INDEX = "logstash-2015.05.20"; // accepts * as wildcard, .e.g log*

	public static void main(String[] args) throws IOException {

		RestHighLevelClient client = new RestHighLevelClient(
				RestClient.builder(new HttpHost("localhost", 9200, "http")));
		SearchRequest searchRequest = new SearchRequest();

		SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();

		BoolQueryBuilder qb = QueryBuilders.boolQuery();

		if (MUST_MATCH.length > 0) {
			for (String match : MUST_MATCH) {
				qb.must(QueryBuilders.matchQuery(MATCH_FIELD, match));
			}
		}

		if (MUST_NOT_MATCH.length > 0) {
			for (String notMatch : MUST_NOT_MATCH) {
				qb.mustNot(QueryBuilders.matchQuery(MATCH_FIELD, notMatch));
			}
		}

		qb.must(QueryBuilders.rangeQuery(TIME_FIELD).gte(START_TIME));
		qb.must(QueryBuilders.rangeQuery(TIME_FIELD).lte(END_TIME));

		searchSourceBuilder.query(qb).fetchSource(FETCH_FIELDS, null);

		searchRequest.indices(INDEX);
		searchRequest.source(searchSourceBuilder);

		SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);

		if (searchResponse.getHits().getTotalHits().value > 0) {

			System.out.println(searchResponse.getHits().getTotalHits());
			
			for (SearchHit hit : searchResponse.getHits()) {
				System.out.println("Match: ");
				for (String fetchField : FETCH_FIELDS) {
					System.out.println(" - " + fetchField + " " + hit.getSourceAsMap().get(fetchField));
				}
			}
		} else {
			System.out.println("No results matching the criteria.");
		}

		client.close();

	}
}

The Java code above is pretty clear (and hopefully clean). Looking at the constants you should note that:

Two fields @timestamp and @message are used both for querying and for the output.
- A specific time range should be matched in @timestamp.
- There are strings that must be present and such that should not be in @message.
Only one index is queried. You could specify more with a wildcard match.

The query is composed using QueryBuilders. It's quite powerful and has many more options than ones used.

Once you run the above code, you should see the following output acknowledging two successful hits for our search:

2 hits
Match: 
 - @timestamp 2015-05-20T14:29:07.004Z
 - @message 97.73.86.149 - - [2015-05-20T14:29:07.004Z] "GET /uploads/brewster-shaw.jpg HTTP/1.1" 200 2317 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.50 Safari/534.24"
Match: 
 - @timestamp 2015-05-20T16:15:34.664Z
 - @message 139.18.210.214 - - [2015-05-20T16:15:34.664Z] "GET /uploads/brewster-shaw.jpg HTTP/1.1" 200 8345 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.50 Safari/534.24"

Extending it further

It's easy to extend the code further and here are few ideas:

Alerting - if you run the above code periodically and the results are more than 0, you could further process the output to send you a notification by mail, Slack, etc. Such a functionality in ELK (Elasticsearch Logstash Kibana) is also present but it is paid.
Aggregation of the results from many independent Elasticsearch clusters. The above code connects to localhost but you could connect to other hosts too and aggregate the results in the end.

Thank you for reading this article and good luck with using RestHighLevelClient further.