elasticsearch exploring your data

exploring your data

1
2
3
4
5
6
7
penn@ubuntu:~$ wget https://raw.githubusercontent.com/elastic/elasticsearch/master/docs/src/test/resources/accounts.json
penn@ubuntu:~$ wc -l accounts.json
2000 accounts.json
penn@ubuntu:~$ curl -XPOST '127.0.0.1:9200/bank/account/_bulk?pretty&refresh' --data-binary "@accounts.json"
penn@ubuntu:~$ curl '127.0.0.1:9200/_cat/indices?v'
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open bank zFGyqA6oSBq1VxpW63ssMQ 5 1 1000 0 656kb 656kb
  1. The Search API

    1
    There are two basic ways to run searches: one is by sending search parameters through the REST request URI and the other by sending them through the REST request body.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?q=*&sort=account_number:asc&pretty'
    {
    "took" : 96, //time in milliseconds for Elasticsearch to execute the search
    "timed_out" : false, //tells us if the search timed out or not
    "_shards" : { // tells us how many shards were searched, as well as a count of the successful/failed searched shards
    "total" : 5,
    "successful" : 5,
    "failed" : 0
    },
    "hits" : { //search results
    "total" : 1000, //total number of documents matching our search criteria
    "max_score" : null, //ignore these fields for now
    "hits" : [ //actual array of search results (defaults to first 10 documents)
    {
    "_index" : "bank",
    "_type" : "account",
    "_id" : "0",
    "_score" : null, //ignore these fields for now
    "_source" : {
    "account_number" : 0,
    "balance" : 16623,
    "firstname" : "Bradshaw",
    "lastname" : "Mckenzie",
    "age" : 29,
    "gender" : "F",
    "address" : "244 Columbus Place",
    "employer" : "Euron",
    "email" : "bradshawmckenzie@euron.com",
    "city" : "Hobucken",
    "state" : "CO"
    },
    "sort" : [ //sort key for results (missing if sorting by score)
    0
    ]
    },
    1
    2
    3
    4
    5
    6
    7
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "query": { "match_all": {} },
    > "sort": [
    > { "account_number": "asc" }
    > ]
    > }'
  2. Introducing the Query Language

    1
    2
    3
    4
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "query": { "match_all": {} }
    > }'
    1
    The query part tells us what our query definition is and the match_all part is simply the type of query that we want to run.
    1
    2
    3
    4
    5
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "query": { "match_all": {} },
    > "size": 1
    > }'
    1
    if size is not specified, it defaults to 10.
    1
    2
    3
    4
    5
    6
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "query": { "match_all": {} },
    > "from": 10,
    > "size": 10
    > }'
    1
    2
    3
    This example does a match_all and returns documents 11 through 20
    The from parameter (0-based) specifies which document index to start from and the size parameter specifies how many documents to return starting at the from parameter.
    If from is not specified, it defaults to 0.
    1
    2
    3
    4
    5
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "query": { "match_all": {} },
    > "sort": { "balance": { "order": "desc" } }
    > }'
    1
    This example does a match_all and sorts the results by account balance in descending order and returns the top 10 (default size) documents.
  3. Executing Searches

    1
    2
    3
    4
    5
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "query": { "match_all": {} },
    > "_source": ["account_number", "balance"]
    > }'
    1
    2
    Note that the above example simply reduces the _source field.
    It will still only return one field named _source but within it, only the fields account_number and balance are included.
    1
    2
    3
    4
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "query": { "match": { "account_number": 20 } }
    > }'
    1
    This example returns the account numbered 20
    1
    2
    3
    4
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "query": { "match": { "address": "mill" } }
    > }'
    1
    This example returns all accounts containing the term "mill" in the address
    1
    2
    3
    4
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "query": { "match": { "address": "mill lane" } }
    > }'
    1
    This example returns all accounts containing the term "mill" or "lane" in the address
    1
    2
    3
    4
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "query": { "match_phrase": { "address": "mill lane" } }
    > }'
    1
    This example is a variant of match (match_phrase) that returns all accounts containing the phrase "mill lane" in the address
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "query": {
    > "bool": {
    > "must": [
    > { "match": { "address": "mill" } },
    > { "match": { "address": "lane" } }
    > ]
    > }
    > }
    > }'
    1
    In the above example, the bool must clause specifies all the queries that must be true for a document to be considered a match.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "query": {
    > "bool": {
    > "should": [
    > { "match": { "address": "mill" } },
    > { "match": { "address": "lane" } }
    > ]
    > }
    > }
    > }'
    1
    In the above example, the bool should clause specifies a list of queries either of which must be true for a document to be considered a match.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "query": {
    > "bool": {
    > "must_not": [
    > { "match": { "address": "mill" } },
    > { "match": { "address": "lane" } }
    > ]
    > }
    > }
    > }'
    1
    In the above example, the bool must_not clause specifies a list of queries none of which must be true for a document to be considered a match.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "query": {
    > "bool": {
    > "must": [
    > { "match": { "age": "40" } }
    > ],
    > "must_not": [
    > { "match": { "state": "ID" } }
    > ]
    > }
    > }
    > }'
    1
    This example returns all accounts of anybody who is 40 years old but don’t live in ID(aho)
  4. Executing Filters

    1
    2
    3
    4
    5
    6
    7
    8
    The score is a numeric value that is a relative measure of how well the document matches the search query that we specified.
    The higher the score, the more relevant the document is, the lower the score, the less relevant the document is.

    The bool query that we introduced in the previous section also supports filter clauses which allow to use a query to restrict the documents that will be matched by other clauses, without changing how scores are computed.
    As an example, let’s introduce the range query, which allows us to filter documents by a range of values. This is generally used for numeric or date filtering.

    This example uses a bool query to return all accounts with balances between 20000 and 30000, inclusive.
    In other words, we want to find accounts with a balance that is greater than or equal to 20000 and less than or equal to 30000.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "query": {
    > "bool": {
    > "must": { "match_all": {} },
    > "filter": {
    > "range": {
    > "balance": {
    > "gte": 20000,
    > "lte": 30000
    > }
    > }
    > }
    > }
    > }
    > }'
  5. Executing Aggregations

    1
    2
    Aggregations provide the ability to group and extract statistics from your data.
    In Elasticsearch, you have the ability to execute searches returning hits and at the same time return aggregated results separate from the hits all in one response.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "size": 0,
    > "aggs": {
    > "group_by_state": {
    > "terms": {
    > "field": "state.keyword"
    > }
    > }
    > }
    > }'
    1
    2
    This example groups all the accounts by state, and then returns the top 10 (default) states sorted by count descending
    Note that we set size=0 to not show search hits because we only want to see the aggregation results in the response.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "size": 0,
    > "aggs": {
    > "group_by_state": {
    > "terms": {
    > "field": "state.keyword"
    > },
    > "aggs": {
    > "average_balance": {
    > "avg": {
    > "field": "balance"
    > }
    > }
    > }
    > }
    > }
    > }'
    1
    2
    Notice how we nested the average_balance aggregation inside the group_by_state aggregation. This is a common pattern for all the aggregations.
    You can nest aggregations inside aggregations arbitrarily to extract pivoted summarizations that you require from your data.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "size": 0,
    > "aggs": {
    > "group_by_state": {
    > "terms": {
    > "field": "state.keyword",
    > "order": {
    > "average_balance": "desc"
    > }
    > },
    > "aggs": {
    > "average_balance": {
    > "avg": {
    > "field": "balance"
    > }
    > }
    > }
    > }
    > }
    > }'
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
    > {
    > "size": 0,
    > "aggs": {
    > "group_by_age": {
    > "range": {
    > "field": "age",
    > "ranges": [
    > {
    > "from": 20,
    > "to": 30
    > },
    > {
    > "from": 30,
    > "to": 40
    > },
    > {
    > "from": 40,
    > "to": 50
    > }
    > ]
    > },
    > "aggs": {
    > "group_by_gender": {
    > "terms": {
    > "field": "gender.keyword"
    > },
    > "aggs": {
    > "average_balance": {
    > "avg": {
    > "field": "balance"
    > }
    > }
    > }
    > }
    > }
    > }
    > }
    > }'
    1
    This example demonstrates how we can group by age brackets (ages 20-29, 30-39, and 40-49), then by gender, and then finally get the average account balance, per age bracket, per gender