Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
File encoding and code page recognition
#1
Hello everyone!

This is a question about text file encoding and code pages

I need to deal with a lot of text files, and their encoding is not uniform, there may be, ANSI, UTF8, UTF-16, GB2312...

The encoding in the example is GB2312 and the code page is 936

Under Powershell, I need to specify the encoding when reading the file, otherwise the read text will be garbled, So in the Powershell code, added code that recognizes the text encoding

Suppose, after reading the file, I need to do a replacement operation,  replace "测试"  to "正式"

Finally, I need to save it in the original encoding format

Under QM, to perform the above operation, the text must be converted to UTF8, otherwise the replacement operation cannot be completed
But I can't do coding and code page-related programming

Here's the code for powershell, How to implement similar text file encoding and code page recognition under QM? 

Thanks in advance for any advice and help
david
 
Code:
Copy      Help
 
$codes = @'
public static class GuessCoder
{
    public static string Detect(string file)
    {
        byte[] data=System.IO.File.ReadAllBytes(file);
        if (data.Length > 2 && data[0] == 0xFF && data[1] == 0xFE){return "Unicode";}
        if (data.Length > 2 && data[0] == 0xFE && data[1] == 0xFF){return "UTF-16BE";}
        if (data.Length > 3 && data[0] == 0xEF && data[1] == 0xBB && data[2] == 0xBF){
            return "UTF-8";
        }else{
            int charByteCounter = 1;
            byte curByte;
            for (int i = 0; i < data.Length; i++)
            {
                curByte = data[i];
                if (charByteCounter == 1)
                {
                    if (curByte >= 0x80)
                    {
                        while (((curByte <<= 1) & 0x80) != 0)
                        {
                            charByteCounter++;
                        }
                        if (charByteCounter == 1 || charByteCounter > 6)
                        {
                            return "GB2312";
                        }
                    }
                }
                else
                {
                    if ((curByte & 0xC0) != 0x80)
                    {
                        return "GB2312";
                    }
                    charByteCounter--;
                }
            }
            if (charByteCounter > 1)
            {
               return "GB2312";
            }
            return "UTF-8";
        }
    }
}
'@;
Add-Type -TypeDefinition $codes


$file_in = "$HOME\Desktop\Test.txt"
$file_ok = "$HOME\Desktop\Test_ok.txt"

$checkenc = [GuessCoder]::Detect($file_in)
$checkenc

$enc = [Text.Encoding]::GetEncoding($checkenc)
$enc

$text = [IO.File]::ReadAllText($file_in, $enc)

$text = $text -replace '测试','正式'

[IO.File]::WriteAllText($file_ok, $text, $enc)


Attached Files
.zip   Test.zip (Size: 182 bytes / Downloads: 144)


Messages In This Thread
File encoding and code page recognition - by Davider - 07-21-2022, 02:12 PM

Forum Jump:


Users browsing this thread: 1 Guest(s)