MySQL extracts Json internal fields and dumps them as numbers

background
Problem Analysis

1. The attribute value is in Json format and needs to be processed using Json operation functions
2. The field content is not standardized and messy
3. It takes too long to extract content and format. There are still 9 million+ records.

Comparison of final execution results

Data import comparison

Summarize

This is just a statistic of a simple data migration. The amount of data is not large, but the trouble lies in some intermediate steps processing and consideration.

There is no content about SQL optimization and index optimization, so please be gentle.

background

The number of records in the user's ophthalmology attribute table is about 986w. The purpose is to parse eight fields of the attribute values (json format) of about 29w records into numbers and dump them into records of the statistical table for chart analysis.

The following structures and data are mostly made up by me, so don't take them seriously.

The structure of the user ophthalmology attribute table is as follows

CREATE TABLE `property` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `ownerId` int(11) NOT NULL COMMENT 'Record ID or template ID',
  `ownerType` tinyint(4) NOT NULL COMMENT 'Type. 0: record 1: template',
  `recorderId` bigint(20) NOT NULL DEFAULT '0' COMMENT 'Recorder ID',
  `userId` bigint(20) NOT NULL DEFAULT '0' COMMENT 'User ID',
  `roleId` bigint(20) NOT NULL DEFAULT '0' COMMENT 'Role ID',
  `type` tinyint(4) NOT NULL COMMENT 'Field type. 0: Text 1: Option 2: Time 3: Image 4: ICD10 9: New Image',
  `name` varchar(128) NOT NULL DEFAULT '' COMMENT 'Field name',
  `value` mediumtext NOT NULL COMMENT 'Field value',
  PRIMARY KEY (`id`),
  UNIQUE KEY `idxOwnerIdOwnerTypeNameType` (`ownerType`,`ownerId`,`name`,`type`) USING BTREE,
  KEY `idxUserIdRoleIdRecorderIdName` (`userId`,`roleId`,`recorderId`,`name`)
)ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='Attributes';

Problem Analysis

1. The attribute value is in Json format and needs to be processed using Json operation functions

Because the attribute value is in Json format, as follows. A relatively large Json, but only 8 field values are needed, which are extracted and classified into different statistical indicators.

{ ......
    "sight": {
        "nakedEye": {
            "left": "0.9",
            "right": "0.6"
        },
        "correction": {
            "left": "1",
            "right": "1"
        }
    },
    ......
    "axialLength": {
        "left": "21",
        "right": "12"
    },
    "korneaRadius": {
        "left": "34",
        "right": "33"
    },
    ......
}

Therefore, you need to use the Json operation function: json_extract(value,'$.key1.key2').

But it should be noted that the value extracted by this function is enclosed in "". For example, the result of executing json_extract(value,'$.sight.nakedEye.left') on the above record is "22"; it is also possible that the field value is an empty string, in which case the result is "".

Therefore, you need to use the replace function to delete the "" in the result. The final expression for extracting the field is: replace(json_extract(value,'$.sight.nakedEye.left'),'"','').

The result is NULL if the field does not exist; either the outer sight does not exist, or the inner left does not exist.

2. The field content is not standardized and messy

Ideally, all the numbers filled in are standardized, so after the above step, they can be extracted and directly imported into the new table.

However, the reality is cruel, and the things filled in are a mess. for example:

Number + Notes: 1 (poor coordination), 1-\+ (I guess this is meant to indicate high or low)
Number + Unit: Similar to above, 1mm
Most values or intervals: 22.52/42.45, 1-5
Plain text description: Uncooperative, unable to record
Mixed description of text and numbers: 10 more than the previous time, <1, less than 1, BD234/KD23

There is no other way but to look for product and business matches. Fortunately, there are not many, just over 4,000. You can get an idea by just taking a quick look. The following solutions were obtained:

Starting with a number: Data starting with a number are all correctly recorded data, and the text description can be omitted
Multiple values or intervals: just take the first number
Plain text: indicates no data, so it is excluded.
Mixed text and numbers: Analyze each problem individually and see how much is left after removing the others.

How to do it specifically?

Step 1: Exclude normal numeric data and empty data

WHERE `nakedEyeLeft` REGEXP '[^0-9.]' = 1 // This can already exclude null AND `nakedEyeLeft` != ''

Step 2: If it does not contain a number, set it to NULL or an empty string

SET nakedEyeLeft = IF(nakedEyeLeft NOT regexp '[0-9]', '', nakedEyeLeft)

Step 3: Extract the first value of the data starting with a number

SET nakedEyeLeft = IF((nakedEyeLeft + 0 = 0), nakedEyeLeft, nakedEyeLeft + 0)

Combined together,

SET nakedEyeLeft = IF(nakedEyeLeft NOT regexp '[0-9]''', '', 
                      IF((nakedEyeLeft + 0 = 0), nakedEyeLeft, nakedEyeLeft + 0))
WHERE `nakedEyeLeft` REGEXP '[^0-9.]' = 1 // This can already exclude null AND `nakedEyeLeft` != ''

PS: The SQL for processing one field looks simple, but because 8 fields are processed at a time in batches, the combined SQL becomes very long.

Be careful not to enter the wrong field.

The last one left is the fourth category: mixed text and numbers, with more than 40 items.

Some of them look simple and can be processed automatically using regular expressions, such as <1 or less than 1.

The growth value of the record needs to be calculated by finding the last record: an increase of 10 compared to the last record.

The remaining ones are a bit complicated and need human processing to extract usable data, such as BD234/KD23

I wonder if everyone who sees this also finds this a bit troublesome?

I thought I had done it with gritted teeth, but the business said to just process it as 0. If I find out it is 0 later, I can save it again through the page.

There is no need to determine whether it starts with a number, just add + 0; if it starts with a number, the first number will be retained; otherwise = 0.

The final data format SQL is:

UPDATE property 
SET nakedEyeLeft = IF(nakedEyeLeft NOT regexp '[0-9]''', '', nakedEyeLeft + 0)
WHERE `nakedEyeLeft` REGEXP '[^0-9.]' = 1 // This can already exclude null AND `nakedEyeLeft` != '';

3. It takes too long to extract content and format. There are still 9 million+ records.

The property table has more than 9 million data items, but the only known conditions for the required records are name, ownerType, and type, and there is no way to match the existing index.

If you search directly, you will have to scan the entire table, extract and format data; what's more, you will also need to associate other tables and supplement some other fields of statistical indicators.

In this case, if the statistical table is imported directly, the result is that the two tables + the related table will be locked together for a long time, during which time no changes or insertions can be made, which is not very realistic.

Reduce the number of scan lines

Method 1: Add indexes to name, ownerType, and type to reduce the number of scanned records to 200,000.

But the problem is that after adding indexes to 9 million data, the indexes need to be deleted (because they are not required by business conditions), which will lead to two fluctuations;

Coupled with the subsequent processing time of locking the table, the problem is still very serious.

Method 2: Use a table with fewer records as the driving table, which can be associated with the target table.

CREATE TABLE `property` (
  `ownerId` int(11) NOT NULL COMMENT 'Record ID or template ID',
  `ownerType` tinyint(4) NOT NULL COMMENT 'Type. 0: record 1: template',
  `type` tinyint(4) NOT NULL COMMENT 'Field type. 0: Text 1: Option 2: Time 3: Image 4: ICD10 9: New Image',
  `name` varchar(128) NOT NULL DEFAULT '' COMMENT 'Field name',
  `value` mediumtext NOT NULL COMMENT 'Field value',
    Omit other fields UNIQUE KEY `idxOwnerIdOwnerTypeNameType` (`ownerType`,`ownerId`,`name`,`type`) USING BTREE
)ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='Attributes';

The ownerId in the table can be associated with the record table, plus the previous conditions name, ownerType, type, so it just hits ``idxOwnerIdOwnerTypeNameType (ownerType,ownerId,name,type).

CREATE TABLE `medicalrecord` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(50) NOT NULL DEFAULT '' COMMENT 'Record name',
  `type` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'Record type. ',
    Omit other fields KEY `idxName` (`name`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='record';

The record table can hit the index idxName through name='ophthalmology record', and the number of scanned rows is only 20,000, plus the attribute table 290,000, and the final number of scanned rows is only about 300,000, which is 30 times less than the full table scan of the attribute table! ! ! .

Avoid table lock time for data extraction and formatting

Because there are 8 fields, each field needs to be extracted and formatted, and judgment is required in the middle. In this way, the same extraction and formatting operations in one SQL statement must be executed multiple times.

Therefore, in order to avoid such problems, an intermediate table is needed to temporarily store the extraction and formatting results.

CREATE TABLE `propertytmp` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
   `value` mediumtext NOT NULL COMMENT 'Field value',
  `nakedEyeLeft` varchar(255) DEFAULT NULL COMMENT 'Vision - Naked Eye - Left Eye',
  `nakedEyeRight` varchar(255) DEFAULT NULL COMMENT 'Vision - Naked Eye - Right Eye',
  `correctionLeft` varchar(255) DEFAULT NULL COMMENT 'Vision-Correction-Left Eye',
  `correctionRight` varchar(255) DEFAULT NULL COMMENT 'Vision-Correction-Right Eye',
  `axialLengthLeft` varchar(255) DEFAULT NULL COMMENT 'Axial length - left eye',
  `axialLengthRight` varchar(255) DEFAULT NULL COMMENT 'Axial length - right eye',
  `korneaRadiusLeft` varchar(255) DEFAULT NULL COMMENT 'Corneal curvature - left eye',
  `korneaRadiusRight` varchar(255) DEFAULT NULL COMMENT 'Corneal curvature - right eye',
  `updated` datetime NOT NULL COMMENT 'Update time',
  `deleted` tinyint(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`)
)ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

First import the data into the table, extract it based on this, and then format it.

Comparison of final execution results

Data import comparison

Results: Full table scan, attribute table import into intermediate table (40s), attribute table new index + import (6s + 3s), association import (1.4s).

Because it needs to be associated with other tables, it is not as ideal as expected.

Intermediate table data extraction: 7.5s

UPDATE `propertytmp` 
SET nakedEyeLeft = REPLACE(json_extract(value,'$.sight.axialLength.left'),'"',''),
nakedEyeLeft = REPLACE(json_extract(value,'$.sight.nakedEye.left'),'"',''),
nakedEyeRight = REPLACE(json_extract(value,'$.sight.nakedEye.right'),'"',''),
correctionLeft = REPLACE(json_extract(value,'$.sight.correction.left'),'"',''),
correctionRight = REPLACE(json_extract(value,'$.sight.correction.right'),'"',''),
axialLengthLeft = REPLACE(json_extract(value,'$.axialLength.left'),'"',''),
axialLengthRight = REPLACE(json_extract(value,'$.axialLength.right'),'"',''),
korneaRadiusLeft = REPLACE(json_extract(value,'$.korneaRadius.left'),'"',''),
korneaRadiusRight = REPLACE(json_extract(value,'$.korneaRadius.right'),'"','');

Intermediate table data formatting: 2.3s

Regularization is faster than I thought.

UPDATE propertytmp 
SET nakedEyeLeft = IF(nakedEyeLeft NOT REGEXP '[0-9]' AND nakedEyeLeft != '', '', nakedEyeLeft + 0), 
nakedEyeRight = IF(nakedEyeRight NOT REGEXP '[0-9]' AND nakedEyeRight != '', '', nakedEyeRight + 0), 
correctionLeft = IF(correctionLeft NOT REGEXP '[0-9]' AND correctionLeft != '', '', correctionLeft + 0),
correctionRight = IF(correctionRight NOT REGEXP '[0-9]' AND correctionRight != '', '', correctionRight + 0),
axialLengthLeft = IF(axialLengthLeft NOT REGEXP '[0-9]' AND axialLengthLeft != '', '', axialLengthLeft + 0),
axialLengthRight = IF(axialLengthRight NOT REGEXP '[0-9]' AND axialLengthRight != '', '', axialLengthRight + 0),
korneaRadiusLeft = IF(korneaRadiusLeft NOT REGEXP '[0-9]' AND korneaRadiusLeft != '', '', korneaRadiusLeft + 0),
korneaRadiusRight = IF(korneaRadiusRight NOT REGEXP '[0-9]' AND korneaRadiusRight != '', '', korneaRadiusRight + 0)
WHERE (`nakedEyeLeft` REGEXP '[^0-9.]' = 1
       AND `nakedEyeLeft` != '')
  OR (`nakedEyeRight` REGEXP '[^0-9.]' = 1
      AND `nakedEyeRight` != '')
  OR (`correctionLeft` REGEXP '[^0-9.]' = 1
      AND `correctionLeft` != '')
  OR (`correctionRight` REGEXP '[^0-9.]' = 1
      AND `correctionRight` != '')
  OR (`axialLengthLeft` REGEXP '[^0-9.]' = 1
      AND `axialLengthLeft` != '')
  OR (`axialLengthRight` REGEXP '[^0-9.]' = 1
      AND `axialLengthRight` != '')
  OR (`korneaRadiusLeft` REGEXP '[^0-9.]' = 1
      AND `korneaRadiusLeft` != '')
  OR (`korneaRadiusRight` REGEXP '[^0-9.]' = 1
      AND `korneaRadiusRight` != '');

Statistical indicators intermediate table

Because when actually importing the statistical indicator table, you also need to exclude empty data and associate other tables for supplementation.

In order to reduce the impact on the indicator table, an intermediate table of the indicator table was created with the same structure and an ID auto-increment of the target table + 10000.

Import the data from the attribute intermediate table into the indicator intermediate table, and then directly INSERT ... SELECT FROM, which will be very fast.

Of course, this step is actually a bit of an overreaction, but in order to avoid some fluctuations online, it is better to be cautious.

Summarize

This is a simple record of data migration experience.

There is no content about index optimization or SQL optimization, I just think that everyone needs to pay attention to performance and consider the impact on users.

This is the end of this article about extracting MySQL Json internal fields and dumping them as numbers. For more related MySQL extracting Json dumps as numbers, please search 123WORDPRESS.COM's previous articles or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:

How to map the fields in MySQL to JSON format in mybatis plus entity class
Example code for converting Mysql query result set into JSON data
How to use MySQL common functions to process JSON
Detailed explanation of querying JSON format fields in MySQL
MySQL json format data query operation
MySQL 8.0 can now handle JSON

<<: CSS implements the bottom tapbar function

>>: Div picture marquee seamless connection implementation code

Share 10 of the latest web front-end frameworks (translation)

In-depth analysis of the reason why the error "The server quit without updating PID file" is reported when MySQL is started

Blog

MySQL extracts Json internal fields and dumps them as numbers

Table of contents

background

Problem Analysis

1. The attribute value is in Json format and needs to be processed using Json operation functions

2. The field content is not standardized and messy

3. It takes too long to extract content and format. There are still 9 million+ records.

Comparison of final execution results

Data import comparison

Summarize

Share 10 of the latest web front-end frameworks (translation)

Application of dynamic image loading technology and use case of jquery.lazyload plug-in

Implementation of Nginx filtering access logs of static resource files

Detailed tutorial on minimizing the installation of CentOS 8.1 virtual machine in VirtualBox

How to set a dotted border in html

jQuery+h5 realizes the nine-square lottery special effect (front-end and back-end code)

How to install mysql via yum on centos7

CSS menu button animation

Implementation of crawler Scrapy image created by dockerfile based on alpine

In-depth analysis of the reason why the error "The server quit without updating PID file" is reported when MySQL is started

Recommend

Quick solution for forgetting MySQL8 password

Example code for implementing beautiful clock animation effects with CSS

More Ways to Use Angle Brackets in Bash

Why is IE6 used by the most people?

CentOS 7 installation and configuration tutorial under VMware10

Causes and solutions for MySQL deadlock

Examples of 4 methods for inserting large amounts of data in MySQL

Detailed installation tutorial of mysql 5.7 under CentOS 6 and 7

VMWare Linux MySQL 5.7.13 installation and configuration tutorial

How to view and execute historical commands in Linux

Centos7 installation of Nginx integrated Lua sample code

Teach you how to deploy zabbix service on saltstack

jQuery plugin to implement minesweeper game (2)

503 service unavailable error solution explanation

How to configure two-way certificate verification on nginx proxy server