Mysql implementation of full-text search and keyword scoring method example

1. Introduction

Today a colleague asked me how to use MySQL to implement a full-text search function similar to ElasticSearch and score the search keywords? I immediately had a question in my mind? Why not just use es? Simple, easy to use and fast. But he said that the amount of data is not large, the time given by the client is very limited, and there is no time to build es, so let's take a look at the full-text search function of MySQL!

Starting from version 5.7.6, MySQL has a built-in ngram full-text parser to support Chinese, Japanese, and Korean word segmentation. Prior to MySQL 5.7.6, full-text indexing only supported English full-text indexing, not Chinese full-text indexing. It was necessary to use a word segmenter to pre-process the Chinese paragraphs into words and then store them in the database. This article was tested using Mysql 5.7.6, InnoDB database engine. mysql full text search

2. Full-text parser ngram

An ngram is a sequence of n consecutive words in a text. The ngram full-text parser can segment text into words, where each word is a sequence of n consecutive words.
For example, use the ngram full-text parser to tokenize "hello world":

n=1: 'you', 'good', 'world', 'world' 
n=2: 'Hello', 'Good World', 'World' 
n=3: 'Hello world', 'Good world' 
n=4: 'Hello World'

MySQL uses the global variable ngram_token_size to configure the size of n in ngram. Its value range is 1 to 10, and the default value is 2. Usually ngram_token_size is set to the minimum number of words to be queried. If you need to search for single words, set ngram_token_size to 1. With the default value of 2, searching for a single word will yield no results. Because a Chinese word consists of at least two characters, the default value of 2 is recommended.

Let's take a look at the default ngram_token_size of MySQL:

show variables like 'ngram_token_size'

There are two ways to set the ngram_token_size variable:

1. Specify when starting the mysqld command

mysqld --ngram_token_size=2

2. Modify the MySQL configuration file

[mysqld] 
ngram_token_size=2

3. Full-text index

Taking a document data as an example, a new data table t_wenshu is created, and a full-text index is created for the document content field, and 100,000 test data are imported.

1. Create a full-text index when building a table

CREATE TABLE `t_wenshu` (
 `province` varchar(255) DEFAULT NULL,
 `caseclass` varchar(255) DEFAULT NULL,
 `casenumber` varchar(255) DEFAULT NULL,
 `caseid` varchar(255) DEFAULT NULL,
 `types` varchar(255) DEFAULT NULL,
 `title` varchar(255) DEFAULT NULL,
 `content` longtext,
 `updatetime` varchar(255) DEFAULT NULL,
 FULLTEXT KEY `content` (`content`) WITH PARSER `ngram`
)ENGINE=InnoDB DEFAULT CHARSET=utf8;

2. Through alter table method

ALTER TABLE t_wenshu ADD FULLTEXT INDEX content_index (content) WITH PARSER ngram;

3. Through create index method

CREATE FULLTEXT INDEX content_index ON t_wenshu (content) WITH PARSER ngram;

4. Search Mode

Natural Language Retrieval

(IN NATURAL LANGUAGE MODE) Natural language mode is the default full-text search mode of MySQL. The natural language mode cannot use operators and cannot specify complex queries such as keywords must appear or must not appear.

Boolean search

(IN BOOLEAN MODE) Eliminate words that appear in more than half of the matching rows. For example, if every row contains the word "this", no results will be found when searching with "this". This is very useful when there are a large number of records, because the database believes that it is meaningless to find all the rows. In this case, "this" is almost regarded as a stopword. Operators can be used in Boolean search mode to support complex queries such as specifying that a keyword must appear or must not appear or whether the keyword weight is high or low.

● Features of IN BOOLEAN MODE:
Do not remove rows that have more than 50% matching criteria.
Does not automatically sort in reverse order of relevance.
It is possible to search fields without a FULLTEXT index, but it will be very slow.
·Limit the maximum and minimum character strings.
· Apply Stopwords.
● Search syntax rules:
+ Must be present (data entries not containing this keyword will be ignored).
- Not allowed (excluding the specified keyword, all entries containing the keyword will be ignored).
> Increase the weight of this matching data.
< Reduce the weight of the matching data.
~ turns its correlation from positive to negative, indicating that having the word will reduce the correlation (but not like - which excludes it), it just ranks lower and has a lower weight.
* Wildcard, unlike other syntaxes where it is placed at the front, this one should be placed after the string.
" " Use double quotation marks to enclose a sentence to indicate that it must match the original sentence completely and characters cannot be separated.

Query expansion search

Note: (WITH QUERY EXPANSION) Since query expansion may bring many irrelevant queries, use it with caution!

5. Search Query

1) Query the records containing "theft" in the content. The query statement is as follows

select caseid,content, MATCH ( content) AGAINST ('Theft') as score from t_wenshu where MATCH ( content) AGAINST ('Theft' IN NATURAL LANGUAGE MODE)

2) Search for records containing "挑交惹事" in the content. The query statement is as follows

select caseid,content, MATCH ( content) AGAINST ('寻怨惹事') as score from t_wenshu where MATCH ( content) AGAINST ('寻怨惹事' IN NATURAL LANGUAGE MODE) ;

3) Single Chinese character, search for records containing "我" in the content, the query statement is as follows

select caseid,content, MATCH ( content) AGAINST ('我') as score from t_wenshu where MATCH ( content) AGAINST ('我' IN NATURAL LANGUAGE MODE) ;

Note: Because the value of the global variable ngram_token_size is set to 2. If you want to query a single Chinese character, you need to modify ngram_token_size = 1 in the configuration file my.ini and restart the mysqld service. I will not try this here.

4) The query field content contains "dangerous driving" and "picking quarrels and provoking trouble" as follows:

select caseid,content, MATCH (content) AGAINST ('+dangerous driving+picking quarrels and provoking troubles') as score from t_wenshu where MATCH (content) AGAINST ('+dangerous driving+picking quarrels and provoking troubles' IN BOOLEAN MODE);

5) The query field content contains "dangerous driving" but does not contain "picking quarrels and provoking trouble" as follows:

select caseid,content, MATCH (content) AGAINST ('+dangerous driving-picking quarrels and provoking troubles') as score from t_wenshu where MATCH (content) AGAINST ('+dangerous driving-picking quarrels and provoking troubles' IN BOOLEAN MODE);

6) The query field conent contains "dangerous driving" or "picking quarrels and provoking trouble" as follows:

select caseid,content, MATCH (content) AGAINST ('Dangerous driving and provoking trouble') as score from t_wenshu where MATCH (content) AGAINST ('Dangerous driving and provoking trouble' IN BOOLEAN MODE);

VI. Conclusion

1) Before using MySQL full-text indexing, find out the support status of each version;

2) Full-text indexing is N times faster than like + %, but there may be accuracy issues;

3) If a large amount of data needs to be fully indexed, it is recommended to add the data first and then create the index;

4) For Chinese, you can use MySQL 5.7.6 or later versions, or third-party plug-ins such as Sphinx and Lucene;

5) The field name used by the MATCH() function must be consistent with the field name specified when creating the full-text index, and can only be fields of the same table and cannot cross tables;

This is the end of this article about MySQL full-text search and keyword scoring. For more relevant MySQL full-text search and keyword scoring content, please search 123WORDPRESS.COM's previous articles or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in: