Implement a simple search engine based on MySQL

Implementing a search engine based on MySQL

1. ngram full-text parser
2. Create a full-text index

1. Create a full-text index when building a table
2. Through alter table method
3. Through create index method

3. Search method

1. Natural Language Search (NATURAL LANGUAGE MODE)

4. Comparison with Like

Implementing a search engine based on MySQL

Preface:

In fact, MySQL has supported full-text indexing for a long time, but it has only supported English searches. Starting from version 5.7.6, MySQL has a built-in ngram full-text parser to support Chinese, Japanese, and Korean word segmentation.

Mysql full-text index adopts the principle of inverted index. In the inverted index, the keyword is the primary key, and each keyword corresponds to a series of files in which the keyword appears. In this way, when a user searches for a keyword, the sorting program locates the keyword in the inverted index and can immediately find all the files containing the keyword.

This article is tested based on MySQL 8.0, and the database engine used is InnoDB

1. ngram full-text parser

An ngram is a sequence of n consecutive words in a text. The ngram full-text parser is able to tokenize text, where each word is a sequence of n consecutive words. For example, use the ngram full-text parser to segment "你好靓仔":

n=1: 'you', 'good', 'pretty', 'boy' 
n=2: 'Hello', 'Very pretty', 'Handsome boy' 
n=3: 'You are so pretty', 'You are so pretty' 
n=4: 'Hello handsome boy'

In MySQL, the global variable ngram_token_size is used to configure the size of n in ngram. Its value range is 1 to 10, and the default value is 2. Usually ngram_token_size is set to the minimum number of words to be queried. If you need to search for single words, set ngram_token_size to 1. With the default value of 2, searching for a single word will yield no results. Because a Chinese word consists of at least two characters, the default value of 2 is recommended.

You can view the default ngram_token_size of MySQL by running the following command:

show variables like 'ngram_token_size'

There are two ways to set the value of the global variable ngram_token_size :

(1) When starting the mysqld command, specify:

mysqld --ngram_token_size=2

(2) Modify the MySQL configuration file my.ini and add a line of parameters at the end:

ngram_token_size=2

2. Create a full-text index

1. Create a full-text index when building a table

CREATE TABLE `article` (
  `id` bigint NOT NULL,
  `url` varchar(1024) COLLATE utf8mb4_general_ci NOT NULL DEFAULT '',
  `title` varchar(256) COLLATE utf8mb4_general_ci NOT NULL DEFAULT '',
  `source` varchar(32) COLLATE utf8mb4_general_ci DEFAULT '',
  `keywords` varchar(32) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `publish_time` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `title_index` (`title`) WITH PARSER `ngram`
) 

ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

2. Through alter table method

ALTER TABLE article ADD FULLTEXT INDEX title_index(title) WITH PARSER ngram;

3. Through create index method

CREATE FULLTEXT INDEX title_index ON article (title) WITH PARSER ngram;

3. Search method

1. Natural Language Search (NATURAL LANGUAGE MODE)

Natural language mode is the default full-text search mode of MySQL. The natural language mode cannot use operators and cannot specify complex queries such as keywords must appear or must not appear.

Example:

select * from article where MATCH(title) AGAINST ('Beijing Tourism' IN NATURAL LANGUAGE MODE);

// If no mode is specified, the natural language mode is used by default select * from article where MATCH(title) AGAINST ('北京旅游');

It can be seen that in this mode, searching for "Beijing Travel" can search for content containing "Beijing" or "Travel", because it is divided into two keywords based on natural language.

In the above example, the results returned are automatically sorted by matching degree, with the highest matching degree at the front. The matching degree is a non-negative floating point number.

Example:

// Check the matching degree select * , MATCH(title) AGAINST ('Beijing Tourism') as score from article where MATCH(title) AGAINST ('Beijing Tourism' IN NATURAL LANGUAGE MODE);

2. Boolean search (BOOLEAN MODE)

Boolean search mode can use operators to support complex queries such as specifying that a keyword must appear or must not appear or whether the keyword weight is high or low.

Example:

// No operator // Contains "Dating" or "Strategy"
select * from article where MATCH(title) AGAINST ('Dating Guide' IN BOOLEAN MODE);

// Use operator // Must contain "Dating", can contain "Strategy"
select * from article where MATCH(title) AGAINST ('+Dating Guide' IN BOOLEAN MODE);

More operator examples:

'Dating Tips' 
No operator, means OR, either contains "Dating" or "Strategy"

'+Dating+Strategies'
Must contain both words '+Dating Tips'
It must include "Dating", but the match is higher if it also includes "Strategy".

'+Dating-Strategy'
It must contain "Date" and cannot contain "Strategy".

'+Dating~Strategy'
"Dating" must be included, but if "Strategy" is also included, the match score is lower than the record without "Strategy".

'+Dating+(>Strategies<Tips)'
The query must contain records for "dating" and "strategies" or "dating" and "skills", but "dating strategies" has a higher match than "dating skills".

'Dating*'
The query includes records that begin with "Appointment".

'"Dating Tips"'
Use double quotes to enclose the words to be searched, the effect is similar to like '%Dating Guide%',
For example, "Dating strategies for beginners" will be matched, but "Dating strategies" will not be matched.

4. Comparison with Like

Compared with like query, full-text index has the following advantages:

LIKE only performs fuzzy matching, but full-text indexing provides some grammatical and semantic query functions, and will perform word segmentation operations on the string to be searched, which is determined by the MySQL vocabulary.
The full-text index can set the minimum and maximum length of words, as well as the words to be ignored. These can all be set.
Using full-text index to search a string in a column will return the matching degree, which can be understood as the number of matching keywords, which is a floating point number.

And the performance of full-text search is better than that of like query

The following is a test based on about 50w data:

// like query select * from article where title like '%北京%';

// Full-text index query select * from article where MATCH(title) AGAINST ('北京' IN BOOLEAN MODE);

It can be seen that the like query is 1.536s, and the full-text index query is 0.094s, which is about 16 times faster.

Full-text indexing enables fast searches, but there is also the overhead of maintaining the index. The larger the field length, the larger the full-text index created, which affects the throughput of DML statements. If the amount of data is not large, you can use full-text indexing for searching, which is simple and convenient. However, if the amount of data is large, it is recommended to use a dedicated search engine ElasticSearch to do this.

The above is the details of implementing a simple search engine based on MySQL. For more information about implementing search engines with MySQL, please pay attention to other related articles on 123WORDPRESS.COM! I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:

Sharing of MySQL character set and database engine modification methods
Practical tutorial on modifying MySQL character set
MySQL character set viewing and modification tutorial
MySQL startup error 1067 and invalid recovery after changing character set and restarting
How to change the character set encoding to UTF8 in MySQL 5.5/5.6 under Linux
Comparison of storage engines supported by MySQL database
Differences and comparisons of storage engines in MySQL
MySQL learning summary: a preliminary understanding of the architectural design of the InnoDB storage engine
MySQL changes the default engine and character set details

<<: Implementing custom radio and check box functions with pure CSS

>>: How to customize more beautiful link prompt effect with CSS