How to save Elasticsearch query results using paging in a bash script automation

This blog post is a short cheat sheet about how to save Elasticsearch query results with paging in a bash script automation. The automation does download all the query results in JSON files as long the paging returns results in the hits of the return JSON. The files are numbered 01.json to XX.json.

Here are the steps of the bash script automation:

  1. Define environment variables for how to access your Elasticsearch server and index
  2. Define your Elasticsearch query.
  3. Invoke the curl command to get the first return result.
  4. If the return result contains a page-id invoke the next result with a curl command to download the next page of the full result.
  5. Repeat the download steps as long there is no page id or no hits in the return value.


The following code is an example template for the an .env file used for the automation.

export ELASTIC_SEARCH_URL=YOUR_ELASTIC_SERVER
export ELASTIC_SEARCH_INDEX=YOUR_INDEX
export ELASTIC_SEARCH_USER=YOUR_USER
export ELASTIC_SEARCH_PASSWORD=YOUR_PASSWORD

The following code is an example bash automation. With following sections:

  1. Query data from an index and save the first result.
  2. Identify scroll_id/hits and save the data in the next result.
  3. Loop pages until there is no scroll id or there are no hits available
#!/bin/bash

source ./.env

export PAGE="01.json"

echo "***************************"
echo "1. Query data from an index and save the first result."

curl -X POST \
  -u $ELASTIC_SEARCH_USER:$ELASTIC_SEARCH_PASSWORD \
  "$ELASTIC_SEARCH_URL$ELASTIC_SEARCH_INDEX/_search?scroll=50m" \
  -H "Content-Type: application/json" \
  -d ' {
  "size" : 1000,
  "query": {
    "bool": {
      "must": {
        "match_all": {}
      },
      "should":[
        { "term": {
            "prod_Id": "YOUR_FIRT_PRODUCT_IC"
          }
        },
        { "term": {
            "prod_Id": "YOU_SECOND_PRODUCT_ID"
          }
        }
      ],
      "filter": [
        {
          "term": {
            "vers_Id": "YOUR_VERSION"
          }
        },
        {
          "term": {
            "status": "YOUR_STATUS"
          }
        },
        {
          "term": {
            "language": "english"
          }
        }
      ]
    }
  }
}
' | jq '.' > $PAGE

echo "***************************"
echo "2. Identify scroll_id/hits and save the data in the next result."

SCROLL_ID=$(cat "$PAGE" | jq -c '._scroll_id')
HITS=$(cat "$PAGE" | jq -c '.hits.hits[]')

i=2
export PAGE=02.json

echo "***************************"
echo "3. Loop pages until there is no scroll id or there are no hits available."

if [[ -z ${SCROLL_ID} ]] || [[ -z ${HITS} ]]; then
   echo "EXIT script: No 'scroll_id' or 'hits' are given."
   exit 1
fi

while : 
do
  echo "Download page $PAGE"
  curl -X POST \
   -u $ELASTIC_SEARCH_USER:$ELASTIC_SEARCH_PASSWORD \
   "${ELASTIC_SEARCH_URL}_search/scroll" \
   -H "Content-Type: application/json" \
   -d "{ \"scroll\" : \"50m\", \"scroll_id\" : $SCROLL_ID}" | jq '.' > $PAGE

  ((i=i+1))

  SCROLL_ID=$(cat "$PAGE" | jq -c '._scroll_id')
  HITS=$(cat "$PAGE" | jq -c '.hits.hits[]')
  #echo "--- hits for page ${i} - BEGIN---"
  #echo "${HITS}"
  #echo "--- hits for page ${i} - END---"
  #echo "--- scroll_id for page ${i} - BEGIN---"
  #echo "${SCROLL_ID}"
  #echo "--- scroll_id for page ${i} - END---"
  
  if [[ -z ${SCROLL_ID} ]] || [[ -z ${HITS} ]]; then
    rm $PAGE
    echo "------END-----"
    break
  else
    echo "--------------"
  fi

  if ((i<10));then
    export PAGE="0${i}.json"
  else
    export PAGE="${i}.json"
  fi
done

((i=i-1))
echo "Result: ${i} pages were downloaded."

I hope this was useful to you, and let’s see what’s next?

Greetings,

Thomas

#elasticsearch, #bashscripting, #cheatsheet, #development, #automation

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.

Up ↑