Example code for implementing a simple search engine with MySQL

Preface
Introduction
ngram full-text parser
Create a full-text index
Search method

1. Natural Language Search (NATURAL LANGUAGE MODE)
2. Boolean search (BOOLEAN MODE)

Compare with Like

Summarize

Preface

Some time ago, due to project requirements, we needed to search chat records based on keywords. Isn’t this the function of a search engine?

So the first thing that came to my mind was the ElasticSearch distributed search engine, but for some reasons, the company's server resources were relatively tight, and there were no extra machines to deploy an ElasticSearch service. In addition, the online time was relatively tight, and the amount of data was not large. Then I thought of MySQL's full-text index.

Introduction

In fact, MySQL has supported full-text indexing for a long time, but it has only supported English searches. Starting from version 5.7.6, MySQL has a built-in ngram full-text parser to support Chinese, Japanese, and Korean word segmentation.

Mysql full-text index adopts the principle of inverted index. In the inverted index, the keyword is the primary key, and each keyword corresponds to a series of files in which the keyword appears. In this way, when a user searches for a keyword, the sorting program locates the keyword in the inverted index and can immediately find all the files containing the keyword.

This article is tested based on MySQL 8.0, and the database engine used is InnoDB

ngram full-text parser

An ngram is a sequence of n consecutive words in a text. The ngram full-text parser is able to tokenize text, where each word is a sequence of n consecutive words. For example, use the ngram full-text parser to segment "你好靓仔":

n=1: 'you', 'good', 'pretty', 'boy' 
n=2: 'Hello', 'Very pretty', 'Handsome boy' 
n=3: 'You are so pretty', 'You are so pretty' 
n=4: 'Hello handsome boy'

In MySQL, the global variable ngram_token_size is used to configure the size of n in ngram. Its value range is 1 to 10, and the default value is 2. Usually ngram_token_size is set to the minimum number of words to be queried. If you need to search for single words, set ngram_token_size to 1. With the default value of 2, searching for a single word will yield no results. Because a Chinese word consists of at least two characters, the default value of 2 is recommended.

You can view the default ngram_token_size of MySQL by running the following command:

show variables like 'ngram_token_size'

There are two ways to set the value of the global variable ngram_token_size :

1. Specify when starting the mysqld command:

mysqld --ngram_token_size=2

2. Modify the Mysql configuration file my.ini and add a line of parameters at the end:

ngram_token_size=2

Create a full-text index

1. Create a full-text index when building a table

CREATE TABLE `article` (
  `id` bigint NOT NULL,
  `url` varchar(1024) COLLATE utf8mb4_general_ci NOT NULL DEFAULT '',
  `title` varchar(256) COLLATE utf8mb4_general_ci NOT NULL DEFAULT '',
  `source` varchar(32) COLLATE utf8mb4_general_ci DEFAULT '',
  `keywords` varchar(32) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `publish_time` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `title_index` (`title`) WITH PARSER `ngram`
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

2. Through alter table method

ALTER TABLE article ADD FULLTEXT INDEX title_index(title) WITH PARSER ngram;

3. Through create index method

CREATE FULLTEXT INDEX title_index ON article (title) WITH PARSER ngram;

Search method

1. Natural Language Search (NATURAL LANGUAGE MODE)

Natural language mode is the default full-text search mode of MySQL. The natural language mode cannot use operators and cannot specify complex queries such as keywords must appear or must not appear.

Example

select * from article where MATCH(title) AGAINST ('Beijing Tourism' IN NATURAL LANGUAGE MODE);

// If no mode is specified, the natural language mode is used by default select * from article where MATCH(title) AGAINST ('北京旅游');

It can be seen that in this mode, searching for "Beijing Travel" can search for content containing "Beijing" or "Travel", because it is divided into two keywords based on natural language.

In the above example, the results returned are automatically sorted by matching degree, with the highest matching degree at the front. The matching degree is a non-negative floating point number.

Example

// Check the matching degree select * , MATCH(title) AGAINST ('Beijing Tourism') as score from article where MATCH(title) AGAINST ('Beijing Tourism' IN NATURAL LANGUAGE MODE);

2. Boolean search (BOOLEAN MODE)

Boolean search mode can use operators to support complex queries such as specifying that a keyword must appear or must not appear or whether the keyword weight is high or low.

Example

// No operator // Contains "Dating" or "Strategy"
select * from article where MATCH(title) AGAINST ('Dating Guide' IN BOOLEAN MODE);

// Use operator // Must contain "Dating", can contain "Strategy"
select * from article where MATCH(title) AGAINST ('+Dating Guide' IN BOOLEAN MODE);

More operator examples:

'Dating Tips' 
No operator, means OR, either contains "Dating" or "Strategy"

'+Dating+Strategies'
Must contain both words '+Dating Tips'
It must include "Dating", but the match is higher if it also includes "Strategy".

'+Dating-Strategy'
It must contain "Date" and cannot contain "Strategy".

'+Dating~Strategy'
"Dating" must be included, but if "Strategy" is also included, the match score is lower than the record without "Strategy".

'+Dating+(>Strategies<Tips)'
The query must contain records for "dating" and "strategies" or "dating" and "skills", but "dating strategies" has a higher match than "dating skills".

'Dating*'
The query includes records that begin with "Appointment".

'"Dating Tips"'
Use double quotes to enclose the words to be searched, the effect is similar to like '%Dating Guide%',
For example, "Dating strategies for beginners" will be matched, but "Dating strategies" will not be matched.

Compare with Like

Compared with like query, full-text index has the following advantages:

LIKE only performs fuzzy matching, but full-text indexing provides some grammatical and semantic query functions, and will perform word segmentation operations on the string to be searched, which is determined by the MySQL vocabulary.
The full-text index can set the minimum and maximum length of words, as well as the words to be ignored. These can all be set.
Using full-text index to search a string in a column will return the matching degree, which can be understood as the number of matching keywords, which is a floating point number.

And the performance of full-text search is better than that of like query

The following is a test based on about 50w data:

// like query select * from article where title like '%北京%';

// Full-text index query select * from article where MATCH(title) AGAINST ('北京' IN BOOLEAN MODE);

It can be seen that the like query is 1.536s, and the full-text index query is 0.094s, which is about 16 times faster.

Summarize

Full-text indexing enables fast searches, but there is also the overhead of maintaining the index. The larger the field length, the larger the full-text index created, which affects the throughput of DML statements. If the amount of data is not large, you can use full-text indexing for searching, which is simple and convenient. However, if the amount of data is large, it is recommended to use a dedicated search engine ElasticSearch to do this.

This is the end of this article about the sample code for implementing a simple search engine in MySQL. For more relevant MySQL search engine content, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:

MySQL full-text index to achieve a simple version of the search engine example code
Detailed introduction to MySQL-Fulltext, a search engine based on MySQL
Scrapy+flask+html to create a sample code for the search engine
Python implements article duplication check function based on search engine
Python practical hand writing a search engine
Detailed explanation of Python large-scale search engine image crawler tool
360 search engine automatic inclusion PHP rewriting solution
PHP implementation code for recording search engine crawling records
Python lossless music search engine implementation code
Implement a simple search engine based on MySQL

<<: Share 5 JS high-order functions

>>: DIV common tasks (Part 1) — General tasks (show scrollbars/hide divs/disable event bubbling, etc.)