MySQL full-text index to achieve a simple version of the search engine example code

MySQL full-text index to achieve a simple version of the search engine example code

Preface

  • Only Innodb and MyISAM storage engines can use full-text indexes (Innodb supports full-text indexes starting from MySQL 5.6)
  • char, varchar, and text type fields can create full-text indexes (fulltext index type)
  • Full-text indexing is based on keywords. To distinguish different keywords, we need to use stopwords.
  • English words are segmented by spaces and commas; Chinese word segmentation is inconvenient (I don’t know how to distinguish different keywords in a sentence)
  • Built-in word segmentation parser ngram supports Chinese, Japanese, Korean (break sentences into a fixed number of phrases)
  • When writing a large amount of data to a table, it is faster to create a full-text index after writing the data (reducing the overhead of maintaining the index)
  • The inverted index (a data structure) of the full-text indexing principle generally uses an associative array to store the mapping between words and their positions in the document in an auxiliary table.

use

Search with MATCH() ... AGAINST

match() indicates that column is being searched, against indicates that string is being searched

View the default word segmentation (use these words to distinguish different keywords); you can also customize the word segmentation to distinguish different keywords

SELECT * FROM information_schema.INNODB_FT_DEFAULT_STOPWORD;

like

+-------+ 
| value | 
+-------+ 
| a | 
| about | 
| an | 
| are | 
| as | 
| at | 
| be | 
| by | 
| com | 
| de | 
| en | 
| for | 
| from |

Three types of full-text search methods

natural language search

Pass a specific string to check through MATCH AGAINST. The default method

boolean search

Add operators to the search string, such as "+" means must contain, "-" means not contain, "*" means wildcard, even if the passed string is small or appears in the stop word, it will not be filtered out

query expansion search

The search string is used to perform a natural language search, then the words from the most relevant rows returned by the search are added to the search string and the search is done again, the query will return the rows from the second search

Related parameters

Configure related parameters

innodb_ft_min_token_size

The default value is 3, which means that at least 3 characters can be used as a keyword. Increasing this value can reduce the size of the full-text index.

innodb_ft_max_token_size

The default value is 84, which means that a keyword can contain up to 84 characters. Limiting this value can reduce the size of the full-text index.

ngram_token_size

The default value is 2, which means that two characters are used as a keyword for the built-in word segmentation parser. For example, to create a full-text index for "abcd", the keywords are 'ab', 'bc', 'cd'
innodb_ft_min_token_size and innodb_ft_max_token_size have no effect when using the ngram parser

Note that these three parameters cannot be modified dynamically. If you modify these parameters, you need to restart the MySQL service and re-establish the full-text index.

Test the innodb engine using full-text index

Prepare

1. Goal

  • Check whether an article contains a certain keyword; the number of times a certain keyword appears in a series of articles
  • Check whether the title of the article contains a certain keyword

2. Set the following parameters to reduce disk IO pressure

SET GLOBAL sync_binlog=100;
SET GLOBAL innodb_flush_log_at_trx_commit=2;

3. Import 1kw data to test full-text indexing

The data comes from online search

Extraction code: iyip

4. The structure of an article table

CREATE TABLE `article` (
 `id` bigint(10) NOT NULL,
 `url` varchar(1024) CHARACTER SET latin1 NOT NULL DEFAULT '',
 `title` varchar(256) NOT NULL DEFAULT '',
 `source` varchar(32) DEFAULT '' COMMENT 'True source',
 `keywords` varchar(32) DEFAULT NULL,
 `publish_time` timestamp NULL DEFAULT NULL,
 PRIMARY KEY (`id`),
 KEY `title_idx` (`title`)
) ENGINE=InnoDB

Use myloader to import test data in multiple threads

First decompress the test data tar -zxf mydumper_dump_article.tar.gz
time myloader -u $user -p $passwd -S $socket -t 32 -d /datas/dump_article -v 3

5. Total data volume and data file and index file size after importing data

SELECT COUNT(*) FROM `article`;
+----------+
| COUNT(*) |
+----------+
| 10000000 |
+----------+
1 row in set (7.85 sec)

SELECT table_name, CONCAT(FORMAT(SUM(data_length) / 1024 / 1024,2),'M') AS dbdata_size, CONCAT(FORMAT(SUM(index_length) / 1024 / 1024,2),'M') AS dbindex_size, CONCAT(FORMAT(SUM(data_length + index_length) / 1024 / 1024 / 1024,2),'G') AS `db_size(G)`, AVG_ROW_LENGTH,table_rows,update_time FROM information_schema.tables WHERE table_schema = DATABASE() and table_name='article';
+------------+-------------+--------------+------------+----------------+------------+---------------------+
| table_name | dbdata_size | dbindex_size | db_size(G) | AVG_ROW_LENGTH | table_rows | update_time |
+------------+-------------+--------------+------------+----------------+------------+---------------------+
| article | 3,710.00M | 1,003.00M | 4.60G | 414 | 9388739 | 2019-07-05 15:31:37 |
+------------+-------------+--------------+------------+----------------+------------+---------------------+

Create a full-text index using the default method

1. The table already has a keyword field (a brief description of the article content), and "," is used as the word separator

select keywords from article limit 10;
+-------------------------------------------------+
| keywords |
+-------------------------------------------------+
| NULL |
| NULL |
| ,marriage,love|
| Hairstyle, Side Parting, Makeup, Fashion |
| Xiao A, |
| , clothing matching, women, fashion |
| beautiful,female|
| Valentine's Day, Dongguan, Women |
| Skin, skin care, skin care, food nutrition, beauty, health care |
| Sanlitun, Beijing, fashion |
+-------------------------------------------------+

2. Search for a keyword without full-text indexing

A full table scan is required

select count(*) from article where keywords like '%Fashion%';
+----------+
| count(*) |
+----------+
| 163 |
+----------+
1 row in set (7.56 sec)

3. Create a full-text index for the keyword field (with, as the word segment)

Set innodb_ft_min_token_size in the my.cnf configuration file and restart the MySQL service (the minimum number of characters used as a keyword is two, and the default number of characters used as a keyword is three)

[mysqld]
innodb_ft_min_token_size=2

3.1 Setting custom stopwords (i.e. word segmentation)

USE mysql;
CREATE TABLE my_stopwords(VALUE VARCHAR(30)) ENGINE = INNODB;
INSERT INTO my_stopwords(VALUE) VALUE (',');
SET GLOBAL innodb_ft_server_stopword_table = 'mysql/my_stopwords';
SHOW GLOBAL VARIABLES WHERE Variable_name IN('innodb_ft_min_token_size','innodb_ft_server_stopword_table');
+---------------------------------+--------------------+
| Variable_name | Value |
+---------------------------------+--------------------+
| innodb_ft_min_token_size | 2 |
| innodb_ft_server_stopword_table | mysql/my_stopwords |
+---------------------------------+--------------------+

3.2 Create a full-text index

alter table article add fulltext index idx_full_keyword(keywords);
* [ ] Query OK, 0 rows affected, 1 warning (1 min 27.92 sec)
* [ ] Records: 0 Duplicates: 0 Warnings: 1

3.3 The remaining disk space must be sufficient. The original table is 4.6G, and the remaining disk space is 5.7G. Adding a full-text index will also fail.

df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 7.8G 6.3G 1.2G 85% /
tmpfs 1.9G 0 1.9G 0% /dev/shm
/dev/mapper/vg_opt-lvol0
            19G 12G 5.7G 68% /datas

A temporary file of the original table size will be created 8.6K Jul 5 16:19 #sql-5250_3533.frm
 4.4G Jul 5 16:20 #sql-ib117-1768830977.ibd


alter table article add fulltext index idx_full_keyword(keywords);
ERROR 1114 (HY000): The table 'article' is full

3.4 Use the created full-text index to query the number of times a keyword appears

The query response time has been greatly improved, only 0.05s; using where keywords like '%时尚%' takes 7.56s

select count(*) from article where match(keywords) against('%时尚%');
+----------+
| count(*) |
+----------+
| 163 |
+----------+
1 row in set (0.05 sec)

3.5 If you need to fully match multiple keywords at the same time, use Boolean full-text search

Indicates the number of records that completely match "Sanlitun, Beijing" select count(*) from article where match(keywords) against('+Sanlitun, Beijing' in boolean mode);
+----------+
| count(*) |
+----------+
| 1 |
+----------+
1 row in set (0.06 sec)

Indicates the number of records matching "Sanlitun" or "Beijing" select count(*) from article where match(keywords) against('Sanlitun,Beijing');
+----------+
| count(*) |
+----------+
| 8 |
+----------+
1 row in set (0.06 sec)

3.6 After creating a full-text index, some other files will be created

96K Jul 5 16:30 FTS_00000000000000a7_00000000000000c0_INDEX_1.ibd
96K Jul 5 16:30 FTS_00000000000000a7_00000000000000c0_INDEX_2.ibd
96K Jul 5 16:30 FTS_00000000000000a7_00000000000000c0_INDEX_3.ibd
96K Jul 5 16:30 FTS_00000000000000a7_00000000000000c0_INDEX_4.ibd
128K Jul 5 16:30 FTS_00000000000000a7_00000000000000c0_INDEX_5.ibd
256K Jul 5 16:30 FTS_00000000000000a7_00000000000000c0_INDEX_6.ibd
96K Jul 5 16:29 FTS_00000000000000a7_BEING_DELETED_CACHE.ibd
96K Jul 5 16:29 FTS_00000000000000a7_BEING_DELETED.ibd
96K Jul 5 16:30 FTS_00000000000000a7_CONFIG.ibd
96K Jul 5 16:29 FTS_00000000000000a7_DELETED_CACHE.ibd
96K Jul 5 16:29 FTS_00000000000000a7_DELETED.ibd
- The first 6 represent the inverted index (auxiliary index table)
- The 7th and 8th columns represent the document ID (DOC_ID) containing the deleted document, whose data is currently being deleted from the full-text index.
- The 9th one indicates the internal status of the FULLTEXT index
- The 10th and 11th columns contain documents that have been deleted but whose data has not yet been removed from the full-text index

Creating a full-text index using the ngram parser

1. Create a full-text index for the title field (this field does not have fixed stopwords, and uses the ngram parser)

You need to set ngram_token_size in the my.cnf configuration file (the default is 2, 2 characters as the keywords of ngram), and restart the MySQL service. Here we use the default value of 2.

select title from article limit 10;
+------------------------------------------------------------------------------+
| title |
+------------------------------------------------------------------------------+
| worth IT |
|Launchpad Jiangnan Leather Factory small show |
|Rare moment behind the scenes of Raw: "Mad Man" is carried backstage|
|Raw: The son yelled at his father, "You are a green tea son." He fought four people alone.|
|Four groups of 30 square meters of finely decorated small apartments, a large number of pictures, and floor plans |
|Sexy smoky cat eye makeup by the nightclub queen|
|Big Show slams The Rock Johnson |
|Girls' Generation Choi Sooyoung's clothing knowledge Lin Yoona, Huang Miying, Kim Taeyeon, and Jung Soojung|                       
|Outdoor outing in Deyang, barbecue in the flower fields|
+------------------------------------------------------------------------------+

2. Create a full-text index for the title field

alter table article add fulltext index ft_index_title(title) with parser ngram;
Query OK, 0 rows affected (3 min 29.22 sec)
Records: 0 Duplicates: 0 Warnings: 0

3. An inverted index will be created (the longer the title field is, the larger the inverted index created)

112M Jul 5 21:46 FTS_000000000000000a7_00000000000000cd_INDEX_1.ibd
28M Jul 5 21:46 FTS_00000000000000a7_00000000000000cd_INDEX_2.ibd
20M Jul 5 21:46 FTS_00000000000000a7_00000000000000cd_INDEX_3.ibd
140M Jul 5 21:46 FTS_00000000000000a7_00000000000000cd_INDEX_4.ibd
128M Jul 5 21:46 FTS_00000000000000a7_00000000000000cd_INDEX_5.ibd
668M Jul 5 21:46 FTS_000000000000000a7_00000000000000cd_INDEX_6.ibd

4. Search for a keyword in the title without creating a full-text index

select count(*) from article where title like '%Outdoor%';
+----------+
| count(*) |
+----------+
| 22058 |
+----------+
1 row in set (8.60 sec)

select count(*) from article where title like '%Background%';
+----------+
| count(*) |
+----------+
| 1142 |
+----------+

5. Use full-text indexing to search for a keyword

Response time has been greatly improved

select count(*) from article where match(title) against('Outdoor');
+----------+
| count(*) |
+----------+
| 22058 |
+----------+
1 row in set (0.07 sec)

select count(*) from article where title like '%Background%';
+----------+
| count(*) |
+----------+
| 1142 |
+----------+
1 row in set (8.31 sec)

6. Note that when the number of characters in the search keyword is greater than 2 (ngram_token_size defines the size), inconsistency will occur

In normal search, the number of records that actually contain this keyword is 6
select count(*) from article where title like '%公子大%';
+----------+
| count(*) |
+----------+
| 6 |
+----------+
1 row in set (8.40 sec)

Full text search, the number of records with the keyword is 9443
select count(*) from article where match(title) against('公子大');
+----------+
| count(*) |
+----------+
|9443|
+----------+
1 row in set (0.06 sec)

The actual number of records that contain this keyword is 1
select count(*) from article where title like '%花田自服%';
+----------+
| count(*) |
+----------+
| 1 |
+----------+
1 row in set (8.33 sec)

The number of records that contain this keyword in the full text search is 3202
select count(*) from article where match(title) against('花田自服');
+----------+
| count(*) |
+----------+
| 3202 |
+----------+
1 row in set (0.06 sec)

in conclusion

  • When a MySQL field has a fixed stopword (space in English, "," "-" in Chinese, etc.), a full-text index is created for the field, which can quickly search for relevant records that contain a certain keyword, achieving the effect of a simple search engine.
  • When a MySQL field does not have a fixed stopword, the built-in parser ngram can be used to divide the field value into a fixed number of keywords (ngram_token_size) for fast search; when the number of characters in the searched keyword is not equal to the size defined by ngram_token_size, there will be a problem that is inconsistent with the actual situation
  • Full-text indexing allows for fast searches, but it also incurs overhead in maintaining the index. The larger the field length, the larger the full-text index created, which will affect the throughput of DML statements. You can use a dedicated full-text search engine, ES, to do this.

refer to

InnoDB FULLTEXT Indexes

Summarize

The above is the full content of this article. I hope that the content of this article will have certain reference learning value for your study or work. Thank you for your support of 123WORDPRESS.COM.

You may also be interested in:
  • Example code for implementing a simple search engine with MySQL
  • Detailed introduction to MySQL-Fulltext, a search engine based on MySQL
  • Scrapy+flask+html to create a sample code for the search engine
  • Python implements article duplication check function based on search engine
  • Python practical hand writing a search engine
  • Detailed explanation of Python large-scale search engine image crawler tool
  • 360 search engine automatic inclusion PHP rewriting solution
  • PHP implementation code for recording search engine crawling records
  • Python lossless music search engine implementation code
  • Implement a simple search engine based on MySQL

<<:  WeChat applet implementation anchor positioning function example

>>:  Ubuntu Docker installation in vmware (container building)

Recommend

Explanation of the new feature of Hadoop 2.X, the recycle bin function

By turning on the Recycle Bin function, you can r...

MySQL 8.0.12 Installation and Configuration Tutorial

This article records the detailed tutorial for in...

Vue implements the digital thousands separator format globally

This article example shares the specific code for...

How to use JavaScript and CSS correctly in XHTML documents

In more and more websites, the use of XHTML is rep...

How to configure Nginx to support ipv6 under Linux system

1. Check whether the existing nginx supports ipv6...

Summary of common commands for building ZooKeeper3.4 middleware under centos7

1. Download and decompress 1. Introduction to Zoo...

Detailed explanation of using grep command in Linux

Linux grep command The Linux grep command is used...

Why I recommend Nginx as a backend server proxy (reason analysis)

1. Introduction Our real servers should not be di...

jQuery implements ad display and hide animation

We often see ads appear after a few seconds and t...

Detailed steps to install Nginx on Linux

1. Nginx installation steps 1.1 Official website ...

Summary of 28 common JavaScript string methods and usage tips

Table of contents Preface 1. Get the length of a ...