MySQL extracts Json internal fields and dumps them as numbers

MySQL extracts Json internal fields and dumps them as numbers

This is just a statistic of a simple data migration. The amount of data is not large, but the trouble lies in some intermediate steps processing and consideration.

There is no content about SQL optimization and index optimization, so please be gentle.

background

The number of records in the user's ophthalmology attribute table is about 986w. The purpose is to parse eight fields of the attribute values ​​(json format) of about 29w records into numbers and dump them into records of the statistical table for chart analysis.

The following structures and data are mostly made up by me, so don't take them seriously.

The structure of the user ophthalmology attribute table is as follows

CREATE TABLE `property` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `ownerId` int(11) NOT NULL COMMENT 'Record ID or template ID',
  `ownerType` tinyint(4) NOT NULL COMMENT 'Type. 0: record 1: template',
  `recorderId` bigint(20) NOT NULL DEFAULT '0' COMMENT 'Recorder ID',
  `userId` bigint(20) NOT NULL DEFAULT '0' COMMENT 'User ID',
  `roleId` bigint(20) NOT NULL DEFAULT '0' COMMENT 'Role ID',
  `type` tinyint(4) NOT NULL COMMENT 'Field type. 0: Text 1: Option 2: Time 3: Image 4: ICD10 9: New Image',
  `name` varchar(128) NOT NULL DEFAULT '' COMMENT 'Field name',
  `value` mediumtext NOT NULL COMMENT 'Field value',
  PRIMARY KEY (`id`),
  UNIQUE KEY `idxOwnerIdOwnerTypeNameType` (`ownerType`,`ownerId`,`name`,`type`) USING BTREE,
  KEY `idxUserIdRoleIdRecorderIdName` (`userId`,`roleId`,`recorderId`,`name`)
)ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='Attributes';

Problem Analysis

1. The attribute value is in Json format and needs to be processed using Json operation functions

Because the attribute value is in Json format, as follows. A relatively large Json, but only 8 field values ​​are needed, which are extracted and classified into different statistical indicators.

{ ......
    "sight": {
        "nakedEye": {
            "left": "0.9",
            "right": "0.6"
        },
        "correction": {
            "left": "1",
            "right": "1"
        }
    },
    ......
    "axialLength": {
        "left": "21",
        "right": "12"
    },
    "korneaRadius": {
        "left": "34",
        "right": "33"
    },
    ......
}

Therefore, you need to use the Json operation function: json_extract(value,'$.key1.key2').

But it should be noted that the value extracted by this function is enclosed in "". For example, the result of executing json_extract(value,'$.sight.nakedEye.left') on the above record is "22"; it is also possible that the field value is an empty string, in which case the result is "".

Therefore, you need to use the replace function to delete the "" in the result. The final expression for extracting the field is: replace(json_extract(value,'$.sight.nakedEye.left'),'"','').

The result is NULL if the field does not exist; either the outer sight does not exist, or the inner left does not exist.

2. The field content is not standardized and messy

Ideally, all the numbers filled in are standardized, so after the above step, they can be extracted and directly imported into the new table.

However, the reality is cruel, and the things filled in are a mess. for example:

  • Number + Notes: 1 (poor coordination), 1-\+ (I guess this is meant to indicate high or low)
  • Number + Unit: Similar to above, 1mm
  • Most values ​​or intervals: 22.52/42.45, 1-5
  • Plain text description: Uncooperative, unable to record
  • Mixed description of text and numbers: 10 more than the previous time, <1, less than 1, BD234/KD23

There is no other way but to look for product and business matches. Fortunately, there are not many, just over 4,000. You can get an idea by just taking a quick look. The following solutions were obtained:

  • Starting with a number: Data starting with a number are all correctly recorded data, and the text description can be omitted
  • Multiple values ​​or intervals: just take the first number
  • Plain text: indicates no data, so it is excluded.
  • Mixed text and numbers: Analyze each problem individually and see how much is left after removing the others.

How to do it specifically?

Step 1: Exclude normal numeric data and empty data

WHERE `nakedEyeLeft` REGEXP '[^0-9.]' = 1 // This can already exclude null AND `nakedEyeLeft` != ''

Step 2: If it does not contain a number, set it to NULL or an empty string

SET nakedEyeLeft = IF(nakedEyeLeft NOT regexp '[0-9]', '', nakedEyeLeft)

Step 3: Extract the first value of the data starting with a number

SET nakedEyeLeft = IF((nakedEyeLeft + 0 = 0), nakedEyeLeft, nakedEyeLeft + 0)

Combined together,

SET nakedEyeLeft = IF(nakedEyeLeft NOT regexp '[0-9]''', '', 
                      IF((nakedEyeLeft + 0 = 0), nakedEyeLeft, nakedEyeLeft + 0))
WHERE `nakedEyeLeft` REGEXP '[^0-9.]' = 1 // This can already exclude null AND `nakedEyeLeft` != ''

PS: The SQL for processing one field looks simple, but because 8 fields are processed at a time in batches, the combined SQL becomes very long.

Be careful not to enter the wrong field.

The last one left is the fourth category: mixed text and numbers, with more than 40 items.

Some of them look simple and can be processed automatically using regular expressions, such as <1 or less than 1.

The growth value of the record needs to be calculated by finding the last record: an increase of 10 compared to the last record.

The remaining ones are a bit complicated and need human processing to extract usable data, such as BD234/KD23

I wonder if everyone who sees this also finds this a bit troublesome?

I thought I had done it with gritted teeth, but the business said to just process it as 0. If I find out it is 0 later, I can save it again through the page.

There is no need to determine whether it starts with a number, just add + 0; if it starts with a number, the first number will be retained; otherwise = 0.

The final data format SQL is:

UPDATE property 
SET nakedEyeLeft = IF(nakedEyeLeft NOT regexp '[0-9]''', '', nakedEyeLeft + 0)
WHERE `nakedEyeLeft` REGEXP '[^0-9.]' = 1 // This can already exclude null AND `nakedEyeLeft` != '';

3. It takes too long to extract content and format. There are still 9 million+ records.

The property table has more than 9 million data items, but the only known conditions for the required records are name, ownerType, and type, and there is no way to match the existing index.

If you search directly, you will have to scan the entire table, extract and format data; what's more, you will also need to associate other tables and supplement some other fields of statistical indicators.

In this case, if the statistical table is imported directly, the result is that the two tables + the related table will be locked together for a long time, during which time no changes or insertions can be made, which is not very realistic.

Reduce the number of scan lines

Method 1: Add indexes to name, ownerType, and type to reduce the number of scanned records to 200,000.

But the problem is that after adding indexes to 9 million data, the indexes need to be deleted (because they are not required by business conditions), which will lead to two fluctuations;

Coupled with the subsequent processing time of locking the table, the problem is still very serious.

Method 2: Use a table with fewer records as the driving table, which can be associated with the target table.

CREATE TABLE `property` (
  `ownerId` int(11) NOT NULL COMMENT 'Record ID or template ID',
  `ownerType` tinyint(4) NOT NULL COMMENT 'Type. 0: record 1: template',
  `type` tinyint(4) NOT NULL COMMENT 'Field type. 0: Text 1: Option 2: Time 3: Image 4: ICD10 9: New Image',
  `name` varchar(128) NOT NULL DEFAULT '' COMMENT 'Field name',
  `value` mediumtext NOT NULL COMMENT 'Field value',
    Omit other fields UNIQUE KEY `idxOwnerIdOwnerTypeNameType` (`ownerType`,`ownerId`,`name`,`type`) USING BTREE
)ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='Attributes';

The ownerId in the table can be associated with the record table, plus the previous conditions name, ownerType, type, so it just hits ``idxOwnerIdOwnerTypeNameType (ownerType,ownerId,name,type).

CREATE TABLE `medicalrecord` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(50) NOT NULL DEFAULT '' COMMENT 'Record name',
  `type` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'Record type. ',
    Omit other fields KEY `idxName` (`name`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='record';

The record table can hit the index idxName through name='ophthalmology record', and the number of scanned rows is only 20,000, plus the attribute table 290,000, and the final number of scanned rows is only about 300,000, which is 30 times less than the full table scan of the attribute table! ! ! .

Avoid table lock time for data extraction and formatting

Because there are 8 fields, each field needs to be extracted and formatted, and judgment is required in the middle. In this way, the same extraction and formatting operations in one SQL statement must be executed multiple times.

Therefore, in order to avoid such problems, an intermediate table is needed to temporarily store the extraction and formatting results.

CREATE TABLE `propertytmp` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
   `value` mediumtext NOT NULL COMMENT 'Field value',
  `nakedEyeLeft` varchar(255) DEFAULT NULL COMMENT 'Vision - Naked Eye - Left Eye',
  `nakedEyeRight` varchar(255) DEFAULT NULL COMMENT 'Vision - Naked Eye - Right Eye',
  `correctionLeft` varchar(255) DEFAULT NULL COMMENT 'Vision-Correction-Left Eye',
  `correctionRight` varchar(255) DEFAULT NULL COMMENT 'Vision-Correction-Right Eye',
  `axialLengthLeft` varchar(255) DEFAULT NULL COMMENT 'Axial length - left eye',
  `axialLengthRight` varchar(255) DEFAULT NULL COMMENT 'Axial length - right eye',
  `korneaRadiusLeft` varchar(255) DEFAULT NULL COMMENT 'Corneal curvature - left eye',
  `korneaRadiusRight` varchar(255) DEFAULT NULL COMMENT 'Corneal curvature - right eye',
  `updated` datetime NOT NULL COMMENT 'Update time',
  `deleted` tinyint(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`)
)ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

First import the data into the table, extract it based on this, and then format it.

Comparison of final execution results

Data import comparison

Results: Full table scan, attribute table import into intermediate table (40s), attribute table new index + import (6s + 3s), association import (1.4s).

Because it needs to be associated with other tables, it is not as ideal as expected.

Intermediate table data extraction: 7.5s

UPDATE `propertytmp` 
SET nakedEyeLeft = REPLACE(json_extract(value,'$.sight.axialLength.left'),'"',''),
nakedEyeLeft = REPLACE(json_extract(value,'$.sight.nakedEye.left'),'"',''),
nakedEyeRight = REPLACE(json_extract(value,'$.sight.nakedEye.right'),'"',''),
correctionLeft = REPLACE(json_extract(value,'$.sight.correction.left'),'"',''),
correctionRight = REPLACE(json_extract(value,'$.sight.correction.right'),'"',''),
axialLengthLeft = REPLACE(json_extract(value,'$.axialLength.left'),'"',''),
axialLengthRight = REPLACE(json_extract(value,'$.axialLength.right'),'"',''),
korneaRadiusLeft = REPLACE(json_extract(value,'$.korneaRadius.left'),'"',''),
korneaRadiusRight = REPLACE(json_extract(value,'$.korneaRadius.right'),'"','');

Intermediate table data formatting: 2.3s

Regularization is faster than I thought.

UPDATE propertytmp 
SET nakedEyeLeft = IF(nakedEyeLeft NOT REGEXP '[0-9]' AND nakedEyeLeft != '', '', nakedEyeLeft + 0), 
nakedEyeRight = IF(nakedEyeRight NOT REGEXP '[0-9]' AND nakedEyeRight != '', '', nakedEyeRight + 0), 
correctionLeft = IF(correctionLeft NOT REGEXP '[0-9]' AND correctionLeft != '', '', correctionLeft + 0),
correctionRight = IF(correctionRight NOT REGEXP '[0-9]' AND correctionRight != '', '', correctionRight + 0),
axialLengthLeft = IF(axialLengthLeft NOT REGEXP '[0-9]' AND axialLengthLeft != '', '', axialLengthLeft + 0),
axialLengthRight = IF(axialLengthRight NOT REGEXP '[0-9]' AND axialLengthRight != '', '', axialLengthRight + 0),
korneaRadiusLeft = IF(korneaRadiusLeft NOT REGEXP '[0-9]' AND korneaRadiusLeft != '', '', korneaRadiusLeft + 0),
korneaRadiusRight = IF(korneaRadiusRight NOT REGEXP '[0-9]' AND korneaRadiusRight != '', '', korneaRadiusRight + 0)
WHERE (`nakedEyeLeft` REGEXP '[^0-9.]' = 1
       AND `nakedEyeLeft` != '')
  OR (`nakedEyeRight` REGEXP '[^0-9.]' = 1
      AND `nakedEyeRight` != '')
  OR (`correctionLeft` REGEXP '[^0-9.]' = 1
      AND `correctionLeft` != '')
  OR (`correctionRight` REGEXP '[^0-9.]' = 1
      AND `correctionRight` != '')
  OR (`axialLengthLeft` REGEXP '[^0-9.]' = 1
      AND `axialLengthLeft` != '')
  OR (`axialLengthRight` REGEXP '[^0-9.]' = 1
      AND `axialLengthRight` != '')
  OR (`korneaRadiusLeft` REGEXP '[^0-9.]' = 1
      AND `korneaRadiusLeft` != '')
  OR (`korneaRadiusRight` REGEXP '[^0-9.]' = 1
      AND `korneaRadiusRight` != '');

Statistical indicators intermediate table

Because when actually importing the statistical indicator table, you also need to exclude empty data and associate other tables for supplementation.

In order to reduce the impact on the indicator table, an intermediate table of the indicator table was created with the same structure and an ID auto-increment of the target table + 10000.

Import the data from the attribute intermediate table into the indicator intermediate table, and then directly INSERT ... SELECT FROM, which will be very fast.

Of course, this step is actually a bit of an overreaction, but in order to avoid some fluctuations online, it is better to be cautious.

Summarize

This is a simple record of data migration experience.

There is no content about index optimization or SQL optimization, I just think that everyone needs to pay attention to performance and consider the impact on users.

This is the end of this article about extracting MySQL Json internal fields and dumping them as numbers. For more related MySQL extracting Json dumps as numbers, please search 123WORDPRESS.COM's previous articles or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • How to map the fields in MySQL to JSON format in mybatis plus entity class
  • Example code for converting Mysql query result set into JSON data
  • How to use MySQL common functions to process JSON
  • Detailed explanation of querying JSON format fields in MySQL
  • MySQL json format data query operation
  • MySQL 8.0 can now handle JSON

<<:  CSS implements the bottom tapbar function

>>:  Div picture marquee seamless connection implementation code

Recommend

Mycli is a must-have tool for MySQL command line enthusiasts

mycli MyCLI is a command line interface for MySQL...

Analysis and practice of React server-side rendering principle

Most people have heard of the concept of server-s...

The difference between Input's size and maxlength attributes

I recently used the input size and maxlength attri...

Sample code for implementing follow ads with JavaScript

Floating ads are a very common form of advertisin...

How to modify the length limit of group_concat in Mysql

In MySQL, there is a function called "group_...

Tutorial on installing mysql5.7.18 on mac os10.12

I searched the entire web and found all kinds of ...

How to use docker+devpi to build local pypi source

Some time ago, I needed to use pip downloads freq...

Summary of common knowledge points required for MySQL

Table of contents Primary key constraint Unique p...

The concept of MTR in MySQL

MTR stands for Mini-Transaction. As the name sugg...

Solution to Linux QT Kit missing and Version empty problem

Currently encountering such a problem My situatio...

A "classic" pitfall of MySQL UPDATE statement

Table of contents 1. Problematic SQL statements S...

Continuous delivery using Jenkins and Docker under Docker

1. What is Continuous Delivery The software produ...

The whole process of installing mysql5.7.22 under ARM64 architecture

MySQL download address: https://obs.cn-north-4.my...

4 ways to modify MySQL root password (summary)

Method 1: Use the SET PASSWORD command First log ...