2017-03-20

hadoop►elasticsearch►elasticsearch exploring your data

elasticsearch exploring your data

exploring your data

penn@ubuntu:~$ wget https://raw.githubusercontent.com/elastic/elasticsearch/master/docs/src/test/resources/accounts.json
penn@ubuntu:~$ wc -l accounts.json
2000 accounts.json
penn@ubuntu:~$ curl -XPOST '127.0.0.1:9200/bank/account/_bulk?pretty&refresh' --data-binary "@accounts.json"
penn@ubuntu:~$ curl '127.0.0.1:9200/_cat/indices?v'
health status index    uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   bank     zFGyqA6oSBq1VxpW63ssMQ   5   1       1000            0      656kb          656kb

The Search API

1	There are two basic ways to run searches: one is by sending search parameters through the REST request URI and the other by sending them through the REST request body.

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?q=*&sort=account_number:asc&pretty'
{
  "took" : 96,          //time in milliseconds for Elasticsearch to execute the search
  "timed_out" : false,  //tells us if the search timed out or not
  "_shards" : {         // tells us how many shards were searched, as well as a count of the successful/failed searched shards
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {           //search results
    "total" : 1000,    //total number of documents matching our search criteria
    "max_score" : null, //ignore these fields for now
    "hits" : [         //actual array of search results (defaults to first 10 documents)
      {
        "_index" : "bank",
        "_type" : "account",
        "_id" : "0",
        "_score" : null,  //ignore these fields for now
        "_source" : {
          "account_number" : 0,
          "balance" : 16623,
          "firstname" : "Bradshaw",
          "lastname" : "Mckenzie",
          "age" : 29,
          "gender" : "F",
          "address" : "244 Columbus Place",
          "employer" : "Euron",
          "email" : "bradshawmckenzie@euron.com",
          "city" : "Hobucken",
          "state" : "CO"
        },
        "sort" : [    //sort key for results (missing if sorting by score)
          0
        ]
      },

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "query": { "match_all": {} },
>   "sort": [
>     { "account_number": "asc" }
>   ]
> }'

Introducing the Query Language

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "query": { "match_all": {} }
> }'

1	The query part tells us what our query definition is and the match_all part is simply the type of query that we want to run.

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "query": { "match_all": {} },
>   "size": 1
> }'

1	if size is not specified, it defaults to 10.

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "query": { "match_all": {} },
>   "from": 10,
>   "size": 10
> }'

1
2
3

This example does a match_all and returns documents 11 through 20
The from parameter (0-based) specifies which document index to start from and the size parameter specifies how many documents to return starting at the from parameter.
If from is not specified, it defaults to 0.

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "query": { "match_all": {} },
>   "sort": { "balance": { "order": "desc" } }
> }'

1	This example does a match_all and sorts the results by account balance in descending order and returns the top 10 (default size) documents.

Executing Searches

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "query": { "match_all": {} },
>   "_source": ["account_number", "balance"]
> }'

1 2	Note that the above example simply reduces the _source field. It will still only return one field named _source but within it, only the fields account_number and balance are included.

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "query": { "match": { "account_number": 20 } }
> }'

1	This example returns the account numbered 20

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "query": { "match": { "address": "mill" } }
> }'

1	This example returns all accounts containing the term "mill" in the address

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "query": { "match": { "address": "mill lane" } }
> }'

1	This example returns all accounts containing the term "mill" or "lane" in the address

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "query": { "match_phrase": { "address": "mill lane" } }
> }'

1	This example is a variant of match (match_phrase) that returns all accounts containing the phrase "mill lane" in the address

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "query": {
>     "bool": {
>       "must": [
>         { "match": { "address": "mill" } },
>         { "match": { "address": "lane" } }
>       ]
>     }
>   }
> }'

1	In the above example, the bool must clause specifies all the queries that must be true for a document to be considered a match.

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "query": {
>     "bool": {
>       "should": [
>         { "match": { "address": "mill" } },
>         { "match": { "address": "lane" } }
>       ]
>     }
>   }
> }'

1	In the above example, the bool should clause specifies a list of queries either of which must be true for a document to be considered a match.

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "query": {
>     "bool": {
>       "must_not": [
>         { "match": { "address": "mill" } },
>         { "match": { "address": "lane" } }
>       ]
>     }
>   }
> }'

1	In the above example, the bool must_not clause specifies a list of queries none of which must be true for a document to be considered a match.

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "query": {
>     "bool": {
>       "must": [
>         { "match": { "age": "40" } }
>       ],
>       "must_not": [
>         { "match": { "state": "ID" } }
>       ]
>     }
>   }
> }'

1	This example returns all accounts of anybody who is 40 years old but don’t live in ID(aho)

Executing Filters

The score is a numeric value that is a relative measure of how well the document matches the search query that we specified.
The higher the score, the more relevant the document is, the lower the score, the less relevant the document is.

The bool query that we introduced in the previous section also supports filter clauses which allow to use a query to restrict the documents that will be matched by other clauses, without changing how scores are computed.
As an example, let’s introduce the range query, which allows us to filter documents by a range of values. This is generally used for numeric or date filtering.

This example uses a bool query to return all accounts with balances between 20000 and 30000, inclusive.
In other words, we want to find accounts with a balance that is greater than or equal to 20000 and less than or equal to 30000.

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "query": {
>     "bool": {
>       "must": { "match_all": {} },
>       "filter": {
>         "range": {
>           "balance": {
>             "gte": 20000,
>             "lte": 30000
>           }
>         }
>       }
>     }
>   }
> }'

Executing Aggregations

1
2

Aggregations provide the ability to group and extract statistics from your data.
In Elasticsearch, you have the ability to execute searches returning hits and at the same time return aggregated results separate from the hits all in one response.

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "size": 0,
>   "aggs": {
>     "group_by_state": {
>       "terms": {
>         "field": "state.keyword"
>       }
>     }
>   }
> }'

1
2

This example groups all the accounts by state, and then returns the top 10 (default) states sorted by count descending
Note that we set size=0 to not show search hits because we only want to see the aggregation results in the response.

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "size": 0,
>   "aggs": {
>     "group_by_state": {
>       "terms": {
>         "field": "state.keyword"
>       },
>       "aggs": {
>         "average_balance": {
>           "avg": {
>             "field": "balance"
>           }
>         }
>       }
>     }
>   }
> }'

1
2

Notice how we nested the average_balance aggregation inside the group_by_state aggregation. This is a common pattern for all the aggregations.
You can nest aggregations inside aggregations arbitrarily to extract pivoted summarizations that you require from your data.

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "size": 0,
>   "aggs": {
>     "group_by_state": {
>       "terms": {
>         "field": "state.keyword",
>         "order": {
>           "average_balance": "desc"
>         }
>       },
>       "aggs": {
>         "average_balance": {
>           "avg": {
>             "field": "balance"
>           }
>         }
>       }
>     }
>   }
> }'

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/bank/_search?pretty' -d'
> {
>   "size": 0,
>   "aggs": {
>     "group_by_age": {
>       "range": {
>         "field": "age",
>         "ranges": [
>           {
>             "from": 20,
>             "to": 30
>           },
>           {
>             "from": 30,
>             "to": 40
>           },
>           {
>             "from": 40,
>             "to": 50
>           }
>         ]
>       },
>       "aggs": {
>         "group_by_gender": {
>           "terms": {
>             "field": "gender.keyword"
>           },
>           "aggs": {
>             "average_balance": {
>               "avg": {
>                 "field": "balance"
>               }
>             }
>           }
>         }
>       }
>     }
>   }
> }'

1	This example demonstrates how we can group by age brackets (ages 20-29, 30-39, and 40-49), then by gender, and then finally get the average account balance, per age bracket, per gender

2017-03-20

hadoop►elasticsearch►elasticsearch modifying your data

elasticsearch modifying your data

创建一条数据和更新

penn@ubuntu:~$ curl -XPUT '127.0.0.1:9200/customer/external/1?pretty&pretty' -d '{ "name": "penn" }'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : true
}

penn@ubuntu:~$ curl -XPUT '127.0.0.1:9200/customer/external/1?pretty&pretty' -d '{ "name": "peng" }'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 2,        //version +1
  "result" : "updated",  //数据覆盖更新
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : false
}

penn@ubuntu:~$ curl -XPOST 'localhost:9200/customer/external?pretty&pretty' -d '{ "name": "penn" }'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "AVhU7yUoaCtZxg-qM7ht",  //不指定ID,ES随机分配一个ID
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : true
}

Updating Documents

1	Whenever we do an update, Elasticsearch deletes the old document and then indexes a new document with the update applied to it in one shot.

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/customer/external/1?pretty'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 3,   //注意version变化
  "found" : true,
  "_source" : {
    "name" : "penn"   //内容
  }
}

penn@ubuntu:~$ curl -XPOST '127.0.0.1:9200/customer/external/1/_update?pretty&pretty' -d '{ "doc": { "name": "peng" } }'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 4,    //version + 1
  "result" : "updated",  //操作状态
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  }
}

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/customer/external/1?pretty'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 4,
  "found" : true,
  "_source" : {
    "name" : "peng"  //内容更改
  }
}

penn@ubuntu:~$ curl -XPOST '127.0.0.1:9200/customer/external/1/_update?pretty&pretty' -d '{ "doc": { "name": "peng", "age": 27 } }'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 5,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  }
}

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/customer/external/1?pretty'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 5,
  "found" : true,
  "_source" : {
    "name" : "peng",
    "age" : 27
  }
}

penn@ubuntu:~$ curl -XPOST '127.0.0.1:9200/customer/external/1/_update?pretty&pretty' -d '{ "script" : "ctx._source.age += 5" }'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 6,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  }
}

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/customer/external/1?pretty'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 6,
  "found" : true,
  "_source" : {
    "name" : "peng",
    "age" : 32
  }
}

Deleting Documents

penn@ubuntu:~$ curl -XDELETE '127.0.0.1:9200/customer/external/2?pretty'
{
  "found" : false,
  "_index" : "customer",
  "_type" : "external",
  "_id" : "2",
  "_version" : 1,
  "result" : "not_found",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  }
}

penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/customer/external/2?pretty'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "2",
  "found" : false
}

Batch Processing

1 2	The bulk API executes all the actions sequentially and in order. If a single action fails for whatever reason, it will continue to process the remainder of the actions after it.

penn@ubuntu:~$ curl -XPOST '127.0.0.1:9200/customer/external/_bulk?pretty&pretty' -d'
> {"index":{"_id":"1"}}
> {"name": "penn" }
> {"index":{"_id":"2"}}
> {"name": "penn" }'
{
  "took" : 186,
  "errors" : false,
  "items" : [
    {
      "index" : {
        "_index" : "customer",
        "_type" : "external",
        "_id" : "1",
        "_version" : 9,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "created" : true,
        "status" : 201
      }
    }
  ]
}

penn@ubuntu:~$ curl -XPOST '127.0.0.1:9200/customer/external/_bulk?pretty&pretty' -d'
> {"update":{"_id":"1"}}
> {"doc": { "name": "John Doe becomes Jane Doe" } }
> {"delete":{"_id":"2"}}'
{
  "took" : 61,
  "errors" : false,
  "items" : [
    {
      "update" : {
        "_index" : "customer",
        "_type" : "external",
        "_id" : "1",
        "_version" : 10,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "status" : 200
      }
    }
  ]
}

2017-03-20

hadoop►elasticsearch►elasticsearch exploring your cluster

elasticsearch exploring your cluster

cluster health

penn@ubuntu:~$ curl -XGET http://127.0.0.1:9200/_cat/health?v
epoch      timestamp cluster   status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1478891791 03:16:31  escluster green           1         1      0   0    0    0        0             0                  -                100.0%

cluster    cluster名字
status     green表示正常;yellow表示data正常,但replica不正常;red表示有问题

nodes check

1
2
3

penn@ubuntu:~$ curl -XGET http://127.0.0.1:9200/_cat/nodes?v
ip        heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
127.0.0.1            6          98   0    0.00    0.00     0.00 mdi       *      esnode

List All Indices

1 2	penn@ubuntu:~$ curl -XGET http://127.0.0.1:9200/_cat/indices?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size

Create an index

penn@ubuntu:~$ curl -XPUT http://127.0.0.1:9200/customer?pretty
{
  "acknowledged" : true,
  "shards_acknowledged" : true
}

penn@ubuntu:~$ curl -XGET http://127.0.0.1:9200/_cat/indices?v
health status index    uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   customer QRuRC4l0QBGJlX3BWGI1yQ   5   1          0            0       260b           260b

Index and Query a Document

penn@ubuntu:~$ curl -XPUT '127.0.0.1:9200/customer/external/1?pretty&pretty' -d'
> {
>   "name": "penn"
> }'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : true
}
penn@ubuntu:~$ curl -XGET '127.0.0.1:9200/customer/external/1?pretty'
{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "name" : "penn"
  }
}

Delete an index

penn@ubuntu:~$ curl -XGET http://127.0.0.1:9200/_cat/indices?v
health status index    uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   customer QRuRC4l0QBGJlX3BWGI1yQ   5   1          1            0      3.7kb          3.7kb

penn@ubuntu:~$ curl -XDELETE http://127.0.0.1:9200/customer?pretty
{
  "acknowledged" : true
}

penn@ubuntu:~$ curl -XGET http://127.0.0.1:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size

2017-03-20

hadoop►elasticsearch►elasticsearch safe restart

elasticsearch safe restart

安全重启elasticsearch节点

刷新到磁盘

1 2	[root@ip-172-31-90-193 ~]# curl -XGET http://172.31.90.193:9200/_flush {"_shards":{"total":152,"successful":152,"failed":0}}

先暂停集群的shard自动均衡

[root@ip-172-31-90-193 ~]# curl -XPUT http://172.31.90.193:9200/_cluster/settings -d '
> {
>     "transient" : {
>         "cluster.routing.allocation.enable" : "none"
>     }
> }'
{"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"enable":"none"}}}}}

停止node节点

[root@ip-172-31-90-193 ~]# curl -XGET http://172.31.90.193:9200/_cat/nodes?v
host          ip            heap.percent ram.percent load node.role master name       
172.31.90.193 172.31.90.193           34          99 2.07 d         *      Molten Man
172.31.90.45  172.31.90.45            43          99 0.75 d         m      node-2  

[root@ip-172-31-90-193 ~]# curl -XPOST http://192.168.1.3:9200/_cluster/nodes/_local/_shutdown

启动node节点

开启集群的shard自动均衡

[root@ip-172-31-90-193 ~]# curl -XPUT http://172.31.90.193:9200/_cluster/settings -d'
> {
>     "transient" : {
>         "cluster.routing.allocation.enable" : "all"
>     }
> }'
{"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"enable":"all"}}}}}

2017-03-20

hadoop►elasticsearch►elasticsearch shard

elasticsearch migrate shard

迁移分片

elasticsearch可以通过reroute api来手动进行索引分片的分配.不过要想完全手动,必须先把"cluster.routing.allocation.disable_allocation"参数设置为true,禁止es进行自动索引分片分配,否则你从一节点把分片移到另外一个节点,那么另外一个节点的一个分片又会移到那个节点.

一共有三种操作,分别为：移动（move）,取消（cancel）和分配（allocate）
下面分别介绍这三种情况：
1.移动（move）
  把分片从一节点移动到另一个节点.可以指定索引名和分片号.
2.取消（cancel）
  取消分配一个分片.可以指定索引名和分片号.node参数可以指定在那个节点取消正在分配的分片.allow_primary参数支持取消分配主分片.
3.分配（allocate）
  分配一个未分配的分片到指定节点.可以指定索引名和分片号.node参数指定分配到那个节点.allow_primary参数可以强制分配主分片,不过这样可能导致数据丢失.
4.例如:
    curl -XPOST 'localhost:9200/_cluster/reroute' -d '{  
        "commands" : [ {  
            "move" :   
                {  
                  "index" : "test", "shard" : 0,   
                  "from_node" : "node1", "to_node" : "node2"  
                }  
            },  
           "cancel" :   
                {  
                  "index" : "test", "shard" : 0, "node" : "node1"  
                }  
            },  
            {  
              "allocate" : {  
                  "index" : "test", "shard" : 1, "node" : "node3"  
              }  
            }  
        ]  
    }'

//迁移分片
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{  
  "commands" : [ {  
    "move" :   
      {  
        "index" : "test", "shard" : 0,   
        "from_node" : "node1", "to_node" : "node2"  
      }  
  }
  ]  
}'

分片说明

[wisdom@10 ~]$ curl -XGET '10.0.3.41:9200/_cat/shards'|grep v2-inbound-request-2017.03
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26082  100 26082    0     0  61893      0 --:--:-- --:--:-- --:--:-- 61952
v2-inbound-request-2017.03     1 r STARTED 25071632  35.2gb 10.0.3.41 10.0.3.41
v2-inbound-request-2017.03     1 p STARTED 25071643  35.2gb 10.0.3.40 10.0.3.40
v2-inbound-request-2017.03     3 p STARTED 25107804  35.3gb 10.0.3.41 10.0.3.41
v2-inbound-request-2017.03     3 r STARTED 25107796  35.3gb 10.0.3.42 10.0.3.42
v2-inbound-request-2017.03     2 r STARTED 25098807  35.2gb 10.0.3.42 10.0.3.42
v2-inbound-request-2017.03     2 p STARTED 25098799  35.2gb 10.0.3.40 10.0.3.40
v2-inbound-request-2017.03     4 p STARTED 25108295  35.3gb 10.0.3.41 10.0.3.41
v2-inbound-request-2017.03     4 r STARTED 25108301  35.3gb 10.0.3.42 10.0.3.42
v2-inbound-request-2017.03     0 r STARTED 25105055  35.3gb 10.0.3.42 10.0.3.42
v2-inbound-request-2017.03     0 p STARTED 25105050  35.3gb 10.0.3.40 10.0.3.40

p 表示主分片 primary shard
r 表示副本分片 replica shard
分片数和副本个数在创建索引的时候都可以设置,副本的个数在创建索引之后可以随时更改

分片分配规则

elasticsearch的shard分布是根据集群设置的比重进行分配的,你可以设置:
curl -XPUT 'http://192.168.1.1:9200/_cluster/settings?pretty=true' -d '{
"transient" : {
  "cluster.routing.allocation.balance.shard" : 0.33
  "cluster.routing.allocation.balance.index" : 0.33
  "cluster.routing.allocation.balance.primary" : 0.34
  "cluster.routing.allocation.balance.threshold" : 1
}
}'

elasticsearch内部计算公式是:
weightindex(node, index) = indexBalance * (node.numShards(index) – avgShardsPerNode(index))
weightnode(node, index)  = shardBalance * (node.numShards() – avgShardsPerNode)
weightprimary(node, index) = primaryBalance * (node.numPrimaries() – avgPrimariesPerNode)
weight(node, index) = weightindex(node, index) + weightnode(node, index) + weightprimary(node, index)
如果计算最后的weight(node, index)大于threshold,就会发生shard迁移.
注：cluster.routing.allocation.balance.primary 在1.3.8版本之后被废弃了

在一个已经创立的集群里,shard的分布总是均匀的.但是当你扩容节点的时候,你会发现,它总是先移动replica shard到新节点.这样就导致新节点全部分布的全是副本,主shard几乎全留在了老的节点上.

cluster.routing.allocation.balance参数,比较难找到合适的比例

建议一种方式是在扩容的时候,设置cluster.routing.allocation.enable=primaries.指只允许移动主shard.
当你发现shard数已经迁移了一半的时候,改回cluster.routing.allocation.enable=all.这样后面的全迁移的是副本shard.
扩容之后,shard和主shard的分布还是均匀的.
curl -XPUT 'http://192.168.1.1:9200/_cluster/settings' -d '{
  "transient" : {
    "cluster.routing.allocation.enable” : "primaries"
  }
}'

2017-03-20

hadoop►elasticsearch►how to tune for search speed

elasticsearch how to tune for search spped

How to Tune for search speed

Give memory to the filesystem cache

1
2

Elasticsearch heavily relies on the filesystem cache in order to make search fast.
In general, you should make sure that at least half the available memory goes to the filesystem cache so that elasticsearch can keep hot regions of the index in physical memory.

Use faster hardware

1
2

If your search is I/O bound, you should investigate giving more memory to the filesystem cache (see above) or buying faster drives.
If your search is CPU-bound, you should investigate buying faster CPUs.

Document modeling

1	Documents should be modeled so that search-time operations are as cheap as possible.

Pre-index data

1
2

You should leverage patterns in your queries to optimize the way data is indexed.
For instance, if all your documents have a price field and most queries run range aggregations on a fixed list of ranges, you could make this aggregation faster by pre-indexing the ranges into the index and using a terms aggregations.

Mappings

1	The fact that some data is numeric does not mean it should always be mapped as a numeric field.

Avoid scripts
1
In general, scripts should be avoided

Search rounded dates

Queries on date fields that use now are typically not cacheable since the range that is being matched changes all the time. However switching to a rounded date is often acceptable in terms of user experience, and has the benefit of making better use of the query cache.

Force-merge read-only indices

Indices that are read-only would benefit from being merged down to a single segment. This is typically the case with time-based indices: only the index for the current time frame is getting new documents while older indices are read-only.

Warm up global ordinals

Global ordinals are a data-structure that is used in order to run terms aggregations on keyword fields.
You can tell elasticsearch to load global ordinals eagerly at refresh-time by configuring mappings as described below:
curl -XPUT 'localhost:9200/index?pretty' -d'
{
  "mappings": {
    "type": {
      "properties": {
        "foo": {
          "type": "keyword",
          "eager_global_ordinals": true
        }
      }
    }
  }
}'

Warm up the filesystem cache

If the machine running elasticsearch is restarted, the filesystem cache will be empty, so it will take some time before the operating system loads hot regions of the index into memory so that search operations are fast. You can explicitly tell the operating system which files should be loaded into memory eagerly depending on the file extension using the index.store.preload setting

2017-03-20

hadoop►elasticsearch►how to tune for indexing speed

elasticsearch how to tune for indexing speed

How to tune for indexing speed

Use bulk requests

1	Bulk requests will yield much better performance than single-document index requests.

Use multiple workers/threads to send data to elasticsearch

1	This can be tested by progressively increasing the number of workers until either I/O or CPU is saturated on the cluster.

Increase the refresh interval

1
2

The default index.refresh_interval is 1s, which forces elasticsearch to create a new segment every second.
Increasing this value (to say, 30s) will allow larger segments to flush and decreases future merge pressure.

Disable refresh and replicas for initial loads

1	If you need to load a large amount of data at once, you should disable refresh by setting index.refresh_interval to -1 and set index.number_of_replicas to 0.

Disable swapping

Give memory to the filesystem cache

1 2	The filesystem cache will be used in order to buffer I/O operations. You should make sure to give at least half the memory of the machine running elasticsearch to the filesystem cache.

Use auto-generated ids

1
2

When indexing a document that has an explicit id, elasticsearch needs to check whether a document with the same id already exists within the same shard, which is a costly operation and gets even more costly as the index grows.
By using auto-generated ids, Elasticsearch can skip this check, which makes indexing faster.

Use faster hardware

1	If indexing is I/O bound, you should investigate giving more memory to the filesystem cache (see above) or buying faster drives.

Indexing buffer size

1
2

If your node is doing only heavy indexing, be sure indices.memory.index_buffer_size is large enough to give at most 512 MB indexing buffer per shard doing heavy indexing (beyond that indexing performance does not typically improve).
The default is 10% which is often plenty: for example, if you give the JVM 10GB of memory, it will give 1GB to the index buffer, which is enough to host two shards that are heavily indexing.

2017-03-20

hadoop►elasticsearch►elasticsearch full cluster restart upgrade

elasticsearch full cluster restart upgrade

Full cluster restart upgrade

Disable shard allocation

curl -XPUT 'localhost:9200/_cluster/settings?pretty' -d'
{
  "persistent": {
    "cluster.routing.allocation.enable": "none"
  }
}'

Perform a synced flush

1	curl -XPOST 'localhost:9200/_flush/synced?pretty'

Shutdown and upgrade all nodes

1	Stop all Elasticsearch services on all nodes in the cluster

Upgrade any plugins

1 2	Elasticsearch plugins must be upgraded when upgrading a node. Use the elasticsearch-plugin script to install the correct version of any plugins that you need.

Start the cluster

1
2
3

If you have dedicated master nodes — nodes with node.master set to true(the default) and node.data set to false —  then it is a good idea to start them first.
curl -XGET 'localhost:9200/_cat/health?pretty'
curl -XGET 'localhost:9200/_cat/nodes?pretty'

Wait for yellow

1	As soon as each node has joined the cluster, it will start to recover any primary shards that are stored locally.

Reenable allocation

curl -XPUT 'localhost:9200/_cluster/settings?pretty' -d'
{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  }
}'

curl -XGET 'localhost:9200/_cat/health?pretty'
curl -XGET 'localhost:9200/_cat/recovery?pretty'

2017-03-20

hadoop►elasticsearch►elasticsearch rolling upgrades

elasticsearch rolling upgrades

ES Rolling upgrades

Disable shard allocation

curl -XPUT 'localhost:9200/_cluster/settings?pretty' -d'
{
  "transient": {
    "cluster.routing.allocation.enable": "none"
  }
}'

Stop non-essential indexing and perform a synced flush (Optional)
1
curl -XPOST 'localhost:9200/_flush/synced?pretty'

Stop and upgrade a single node

To upgrade using a zip or compressed tarball:
  Extract the zip or tarball to a new directory, to be sure that you don’t overwrite the config or data directories.
  Either copy the files in the config directory from your old installation to your new installation, or use the -E path.conf= option on the command line to point to an external config directory.
  Either copy the files in the data directory from your old installation to your new installation, or configure the location of the data directory in the config/elasticsearch.yml file, with the path.data setting.

Upgrade any plugins

1 2	Elasticsearch plugins must be upgraded when upgrading a node. Use the elasticsearch-plugin script to install the correct version of any plugins that you need.

Start the upgraded node

1 2	Start the now upgraded node and confirm that it joins the cluster by checking the log file or by checking the output of this request: curl -XGET 'localhost:9200/_cat/nodes?pretty'

Reenable shard allocation

curl -XPUT 'localhost:9200/_cluster/settings?pretty' -d'
{
  "transient": {
    "cluster.routing.allocation.enable": "all"
  }
}'

Wait for the node to recover

You should wait for the cluster to finish shard allocation before upgrading the next node.
curl -XGET 'localhost:9200/_cat/health?pretty'
Wait for the status column to move from yellow to green.

Shards that have not been sync-flushed may take some time to recover.
The recovery status of individual shards can be monitored with the _cat/recovery request:
curl -XGET 'localhost:9200/_cat/recovery?pretty'
If you stopped indexing, then it is safe to resume indexing as soon as recovery has completed.

Repeat

1	When the cluster is stable and the node has recovered, repeat the above steps for all remaining nodes.

2017-03-20

hadoop►elasticsearch►elasticsearch core component

elasticsearch core component

1.NRT(Near realtime)
  Elasticsearch is a near real time search platform.

2.cluster
  A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes.
  A cluster is identified by a unique name which by default is "elasticsearch". This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.

3.node
  A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities.
  Just like a cluster, a node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup.
  This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.

4.index
  An index is a collection of documents that have somewhat similar characteristics.
  An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.
  In a single cluster, you can define as many indexes as you want.

5.type
  Within an index, you can define one or more types.
  A type is a logical category/partition of your index whose semantics is completely up to you.In general, a type is defined for documents that have a set of common fields.
  For example, let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data.

6.document
  A document is a basic unit of information that can be indexed.
   For example, you can have a document for a single customer, another document for a single product, and yet another for a single order.
   This document is expressed in JSON (JavaScript Object Notation) which is an ubiquitous internet data interchange format.

   Within an index/type, you can store as many documents as you want.

   Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.

7.shards & replicas
  An index can potentially store a large amount of data that can exceed the hardware limits of a single node.
  For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.

  To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards.
  When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.

  Sharding is important for two primary reasons:
    1.It allows you to horizontally split/scale your content volume
    2.It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput

  In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason.
  To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.

  Replication is important for two primary reasons:
    1.It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
    2.It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.

  To summarize, each index can be split into multiple shards.
  An index can also be replicated zero (meaning no replicas) or more times.
  Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards).
  The number of shards and replicas can be defined per index at the time the index is created.
  After the index is created, you may change the number of replicas dynamically anytime but you cannot change the number shards after-the-fact.
  当你创建完index后,可以动态添加shard的replica,但不能对shard进行动态修改

  By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.

  Note:
    Each Elasticsearch shard is a Lucene index.
    There is a maximum number of documents you can have in a single Lucene index.
    As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the "_cat/shards" api.

smallasa

I known what i do !

elasticsearch exploring your data

exploring your data

elasticsearch modifying your data

elasticsearch exploring your cluster

elasticsearch safe restart

安全重启elasticsearch节点

elasticsearch migrate shard

迁移分片

分片说明

elasticsearch how to tune for search spped

How to Tune for search speed

elasticsearch how to tune for indexing speed

How to tune for indexing speed

elasticsearch full cluster restart upgrade

Full cluster restart upgrade

elasticsearch rolling upgrades

ES Rolling upgrades

elasticsearch core component

elasticsearch core component