How to use js to determine whether a file is utf-8 encoded

How to use js to determine whether a file is utf-8 encoded

Conventional solution

Use FileReader to read the file in UTF-8 format, and determine whether the file is UTF-8 based on whether the file content contains garbled characters.

If � exists, the file encoding is not utf-8, otherwise it is utf-8.

The code is as follows:

const isUtf8 = async (file: File) => {
  return await new Promise((resolve, reject) => {
    const reader = new FileReader();
    reader.readAsText(file);

    reader.onloadend = (e: any): void => {
      const content = e.target.result;
      const encodingRight = content.indexOf("") === -1;

      if (encodingRight) {
        resolve(encodingRight);
      } else {
        reject(new Error("Encoding format error, please upload UTF-8 format file"));
      }
    };
    
    reader.onerror = () => {
      reject(new Error("File content reading failed, please check if the file is damaged"));
    };
  });
};

The problem with this method is that if the file is very large, such as several GB, the content read by the browser is directly placed in the memory, and the fileReader instance will directly trigger onerror and throw an error, and sometimes the browser will directly crash.

Large file solution

For large files, you can sample the file content and slice the file. Here, 100 slices are used. For each file cut out, cut out the first 1kb segment and read it in string mode. If 1024B is cut right in the middle of a Chinese character encoding, it may cause an error when reading it as a string, that is, � may appear at the beginning and end, and it is considered to be a non-utf-8 segment. At this time, you can take the first half of the string corresponding to 1kb and then determine whether it exists.

The above constants can be adjusted according to requirements.

The code is as follows:

const getSamples = (file: File) => {
  const filesize = file.size;
  const parts: Blob[] = [];
  if (filesize < 50 * 1024 * 1024) {
    parts.push(file);
  } else {
    let total = 100;
    const sampleSize = 1024 * 1024;
    const chunkSize = Math.floor(filesize / total);
    let start = 0;
    let end = sampleSize;
    while (total > 1) {
      parts.push(file.slice(start, end));
      start += chunkSize;
      end += chunkSize;
      total--;
    }
  }
  return parts;
};

const isUtf8 = (filePart: Blob) => {
  return new Promise((resolve, reject) => {
    const fileReader = new FileReader();

    fileReader.readAsText(filePart);

    fileReader.onload = (e) => {
      const str = e.target?.result as string;
      // Take roughly half const sampleStr = str?.slice(4, 4 + str?.length / 2);
      if (sampleStr.indexOf("�") === -1) {
        resolve(void 0);
      } else {
        reject(new Error(Encoding format error, please upload UTF-8 format file"));
      }
    };

    fileReader.onerror = () => {
      reject(new Error(File content reading failed, please check if the file is damaged"));
    };
  });
};

export default async function (file: File) {
  const samples = getSamples(file);
  let res = true;

  for (const filePart of samples) {
    try {
      await isUtf8(filePart);
    } catch (error) {
      res = false;
      break;
    }
  }
  return res;
}

This is the end of this article about how js determines whether a file is encoded in utf-8. For more relevant js judgment utf-8 content, please search 123WORDPRESS.COM's previous articles or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • PHP determines whether the string encoding is utf-8 or gb2312 example
  • PHP regular expression to judge Chinese UTF-8 or GBK and its specific implementation

<<:  mysql5.6.zip format compressed version installation graphic tutorial

>>:  A troubleshooting experience of centos Docker bridge mode unable to access the host Redis service

Recommend

MySQL 8.0.20 installation and configuration method graphic tutorial

MySQL download and installation (version 8.0.20) ...

A link refresh page and js refresh page usage examples

1. How to use the link: Copy code The code is as f...

How to configure /var/log/messages in Ubuntu system log

1. Problem Description Today I need to check the ...

Mysql optimization Zabbix partition optimization

The biggest bottleneck of using zabbix is ​​the d...

Native js implementation of magnifying glass component

This article example shares the specific code for...

VMware Workstation is not compatible with Device/Credential Guard

When installing a virtual machine, a prompt appea...

Mysql string interception and obtaining data in the specified string

Preface: I encountered a requirement to extract s...

Which one should I choose between MySQL unique index and normal index?

Imagine a scenario where, when designing a user t...

...

Encapsulate a simplest ErrorBoundary component to handle react exceptions

Preface Starting from React 16, the concept of Er...

Solution to blank page after Vue packaging

1. Solution to the problem that the page is blank...

Analysis of the Principle and Method of Implementing Linux Disk Partition

remember: IDE disk: the first disk is hda, the se...

Minio lightweight object storage service installation and browser usage tutorial

Table of contents Introduction Install 1. Create ...