Emulate a SQL LIKE search with ElasticSearch -


i'm beginning elasticsearch , trying implement autocomplete feature based on it.

i have autocomplete index field city of type string. here's example of document stored index:

{      "_index":"autocomplete_1435797593949",    "_type":"listing",    "_id":"40716",    "_source":{         "city":"rome",       "tags":[            "listings"       ]    } } 

the analyse configuration looks this:

{      "analyzer":{         "autocomplete_term":{            "tokenizer":"autocomplete_edge",          "filter":[               "lowercase"          ]       },       "autocomplete_search":{            "tokenizer":"keyword",          "filter":[               "lowercase"          ]       }    },    "tokenizer":{         "autocomplete_edge":{            "type":"ngram",          "min_gram":1,          "max_gram":100       }    } } 

the mappings:

{      "autocomplete_1435795884170":{         "mappings":{            "listing":{               "properties":{                  "city":{                     "type":"string",                   "analyzer":"autocomplete_term"                },             }          }       }    } } 

i'm sending following query es:

{      "query":{         "multi_match":{            "query":"rio",          "analyzer":"autocomplete_search",          "fields":[               "city"          ]       }    } } 

as result, following:

{      "took":2,    "timed_out":false,    "_shards":{         "total":5,       "successful":5,       "failed":0    },    "hits":{         "total":1,       "max_score":2.7742395,       "hits":[            {               "_index":"autocomplete_1435795884170",             "_type":"listing",             "_id":"53581",             "_score":2.7742395,             "_source":{                  "city":"rio",                "tags":[                     "listings"                ]             }          }       ]    } } 

for part, works. find document city = "rio" before user has type whole word ("ri" enough).

and here lies problem. want return "rio de janeiro", too. "rio de janeiro", need send following query:

  {          "query":{             "multi_match":{                "query":"rio d",              "analyzer":"standard",              "fields":[                   "city"              ]           }        }     } 

notice "<whitespace>d" there.

another related problem i'd expect @ least cities start "r" returned following query:

  {          "query":{             "multi_match":{                "query":"r",              "analyzer":"standard",              "fields":[                   "city"              ]           }        }     } 

i'd expect "rome", etc... (which document exists in index), however, "rio", again. behave sql like condition, i.e ... 'cityname%'.

what doing wrong?

i this:

  • change tokenizer edge_ngram since said need like 'cityname%' (meaning prefix match):
  "tokenizer": {     "autocomplete_edge": {       "type": "edge_ngram",       "min_gram": 1,       "max_gram": 100     }   } 
  • have field specify autocomplete_search search_analyzer. think it's choice have keyword , lowercase:
  "mappings": {     "listing": {       "properties": {         "city": {           "type": "string",           "index_analyzer": "autocomplete_term",           "search_analyzer": "autocomplete_search"         }       }     }   } 
  • and query simple as:
{   "query": {     "multi_match": {       "query": "r",       "fields": [         "city"       ]     }   } } 

the detailed explanation goes this: split city names in edge ngrams. example, rio de janeiro you'll index like:

           "city": [               "r",               "ri",               "rio",               "rio ",               "rio d",               "rio de",               "rio de ",               "rio de j",               "rio de ja",               "rio de jan",               "rio de jane",               "rio de janei",               "rio de janeir",               "rio de janeiro"            ] 

you notice lowercased. now, you'd want query take text (lowercase or not) , match what's in index. so, r should match list above.

for happen want input text lowercased , kept user set it, meaning shouldn't analyzed. why you'd want this? because have split city names in ngrams , don't want same input text. if user inputs "ri", elasticsearch lowercase - ri - , match against has in index.

a faster alternative multi_match use term, requires application/website lowercase text. reason term doesn't analyze input text @ all.

{   "query": {     "filtered": {       "filter": {         "term": {           "city": {             "value": "ri"           }         }       }     }   } } 

Comments

Popular posts from this blog

OpenCV OpenCL: Convert Mat to Bitmap in JNI Layer for Android -

android - org.xmlpull.v1.XmlPullParserException: expected: START_TAG {http://schemas.xmlsoap.org/soap/envelope/}Envelope -

python - How to remove the Xframe Options header in django? -