Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
File encoding and code page recognition
#1
Hello everyone!

This is a question about text file encoding and code pages

I need to deal with a lot of text files, and their encoding is not uniform, there may be, ANSI, UTF8, UTF-16, GB2312...

The encoding in the example is GB2312 and the code page is 936

Under Powershell, I need to specify the encoding when reading the file, otherwise the read text will be garbled, So in the Powershell code, added code that recognizes the text encoding

Suppose, after reading the file, I need to do a replacement operation,  replace "测试"  to "正式"

Finally, I need to save it in the original encoding format

Under QM, to perform the above operation, the text must be converted to UTF8, otherwise the replacement operation cannot be completed
But I can't do coding and code page-related programming

Here's the code for powershell, How to implement similar text file encoding and code page recognition under QM? 

Thanks in advance for any advice and help
david
 
Code:
Copy      Help
 
$codes = @'
public static class GuessCoder
{
    public static string Detect(string file)
    {
        byte[] data=System.IO.File.ReadAllBytes(file);
        if (data.Length > 2 && data[0] == 0xFF && data[1] == 0xFE){return "Unicode";}
        if (data.Length > 2 && data[0] == 0xFE && data[1] == 0xFF){return "UTF-16BE";}
        if (data.Length > 3 && data[0] == 0xEF && data[1] == 0xBB && data[2] == 0xBF){
            return "UTF-8";
        }else{
            int charByteCounter = 1;
            byte curByte;
            for (int i = 0; i < data.Length; i++)
            {
                curByte = data[i];
                if (charByteCounter == 1)
                {
                    if (curByte >= 0x80)
                    {
                        while (((curByte <<= 1) & 0x80) != 0)
                        {
                            charByteCounter++;
                        }
                        if (charByteCounter == 1 || charByteCounter > 6)
                        {
                            return "GB2312";
                        }
                    }
                }
                else
                {
                    if ((curByte & 0xC0) != 0x80)
                    {
                        return "GB2312";
                    }
                    charByteCounter--;
                }
            }
            if (charByteCounter > 1)
            {
               return "GB2312";
            }
            return "UTF-8";
        }
    }
}
'@;
Add-Type -TypeDefinition $codes


$file_in = "$HOME\Desktop\Test.txt"
$file_ok = "$HOME\Desktop\Test_ok.txt"

$checkenc = [GuessCoder]::Detect($file_in)
$checkenc

$enc = [Text.Encoding]::GetEncoding($checkenc)
$enc

$text = [IO.File]::ReadAllText($file_in, $enc)

$text = $text -replace '测试','正式'

[IO.File]::WriteAllText($file_ok, $text, $enc)


Attached Files
.zip   Test.zip (Size: 182 bytes / Downloads: 137)
#2
I see class GuessCoder is in C#. And the PowerShell code uses .NET. Then better to use the new program. It is very similar to QM, but its script language is C#. Would not need to learn the QM language and convert the class. And much easier to convert PowerShell to C# than to QM.

C# code:
// script ""
var file_in = folders.Desktop + @"Test.txt";
var file_ok = folders.Desktop + @"Test_ok.txt";

var checkenc = GuessCoder.Detect(file_in);
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
var enc = Encoding.GetEncoding(checkenc);
print.it(checkenc, enc);

var text = File.ReadAllText(file_in, enc);

text = text.Replace("测试", "正式");

File.WriteAllText(file_ok, text, enc);

public static class GuessCoder
{
    public static string Detect(string file)
    {
        byte[] data=System.IO.File.ReadAllBytes(file);
        if (data.Length > 2 && data[0] == 0xFF && data[1] == 0xFE){return "Unicode";}
        if (data.Length > 2 && data[0] == 0xFE && data[1] == 0xFF){return "UTF-16BE";}
        if (data.Length > 3 && data[0] == 0xEF && data[1] == 0xBB && data[2] == 0xBF){
            return "UTF-8";
        }else{
            int charByteCounter = 1;
            byte curByte;
            for (int i = 0; i < data.Length; i++)
            {
                curByte = data[i];
                if (charByteCounter == 1)
                {
                    if (curByte >= 0x80)
                    {
                        while (((curByte <<= 1) & 0x80) != 0)
                        {
                            charByteCounter++;
                        }
                        if (charByteCounter == 1 || charByteCounter > 6)
                        {
                            return "GB2312";
                        }
                    }
                }
                else
                {
                    if ((curByte & 0xC0) != 0x80)
                    {
                        return "GB2312";
                    }
                    charByteCounter--;
                }
            }
            if (charByteCounter > 1)
            {
               return "GB2312";
            }
            return "UTF-8";
        }
    }
}
#3
QM code, looks more concise and easier, For C# My level of programming is not very good  Smile

I looked up some examples and got the code below

in qm, how Gets the code page for the text encoding?

I looked for some C code for code page

Macro Macro12
Code:
Copy      Help
_s.getfile("$desktop$\Test.txt") ;;cp gb2312

;Todo: Gets the code page for the text encoding

_s.ConvertEncoding(936 65001) ;;gb2312 to utf8
_s.findreplace("测试" "正式") ;;replace
_s.ConvertEncoding(65001 936) ;;UTF8 to gb2312
_s.setfile("$desktop$\Test_ok.txt")


C code for code page
Code:
Copy      Help
 
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

bool is_str_utf8(const char* str);
bool is_str_gbk(const char* str);

//Judge if it is UTF-8
bool is_str_utf8(const char* str)
{
unsigned int nBytes = 0;//UFT8Can be encoded in 1-6 bytes,ASCIIWith one byte
unsigned char chr = *str;
bool bAllAscii = true;
for (unsigned int i = 0; str[i] != '\0'; ++i) {
chr = *(str + i);
//Determine if asCII is encoded, if not, it is possible that it is UTF8, ASCII is encoded in 7 bits, and the highest bit is labeled 0,0xxxxxxx
if (nBytes == 0 && (chr & 0x80) != 0) {
bAllAscii = false;
}
if (nBytes == 0) {
//If it is not an ASCII code, it should be a multibyte character, which calculates the number of bytes
if (chr >= 0x80) {
if (chr >= 0xFC && chr <= 0xFD) {
nBytes = 6;
}
else if (chr >= 0xF8) {
nBytes = 5;
}
else if (chr >= 0xF0) {
nBytes = 4;
}
else if (chr >= 0xE0) {
nBytes = 3;
}
else if (chr >= 0xC0) {
nBytes = 2;
}
else {
return false;
}
nBytes--;
}
}
else {
//The non-first byte of the multibyte character should be 10xxxxxx
if ((chr & 0xC0) != 0x80) {
return false;
}
//Reduce to zero
nBytes--;
}
}
//Violation of UTF8 encoding rules
if (nBytes != 0) {
return false;
}
if (bAllAscii) { //If it's all ASCII, it's also UTF8
return true;
}
return true;
}

//Judge if it is GB2312
bool is_str_gbk(const char* str)
{
unsigned int nBytes = 0;//GB2312 Can be encoded in 1-2 bytes, Chinese two and one in English
unsigned char chr = *str;
bool bAllAscii = true; //If it's all ASCII,
for (unsigned int i = 0; str[i] != '\0'; ++i) {
chr = *(str + i);
if ((chr & 0x80) != 0 && nBytes == 0) {// Determine whether it is ASCII encoding, if not, it may be GB2312
bAllAscii = false;
}
if (nBytes == 0) {
if (chr >= 0x80) {
if (chr >= 0x81 && chr <= 0xFE) {
nBytes = +2;
}
else {
return false;
}
nBytes--;
}
}
else {
if (chr < 0x40 || chr>0xFE) {
return false;
}
nBytes--;
}//else end
}
if (nBytes != 0) {   //Violation rules
return false;
}
if (bAllAscii) { //If it's all ASCII, it's also GB2312
return true;
}
return true;
}

//Read the file
void read_text(const char* file_name)
{
char line[1024] = { 0 };
FILE *file = fopen(file_name, "rt");
if (!file)
return;
while (1)
{
//End of file read
if (EOF == fscanf(file, "%s", line))
break;
printf("%s\n", line);
}
printf("%d\n", is_str_utf8(line)); 
printf("%d\n", is_str_gbk(line));
fclose(file);

}

//Main function testing
int main() {
read_text("test.txt");
return 0;
}
#4
Macro Macro3198
Code:
Copy      Help
str path.expandpath("$desktop$\Test.txt")
_s.getfile(path) ;;cp gb2312

;Gets the code page for the text encoding
int codePage = CsFunc("" path)
out codePage

_s.ConvertEncoding(codePage 65001) ;;gb2312 to utf8
_s.findreplace("测试" "正式") ;;replace
_s.ConvertEncoding(65001 codePage) ;;UTF8 to gb2312
_s.setfile("$desktop$\Test_ok.txt")



#ret
public static class GuessCoder
{
;;;;public static int DetectCP(string file) {
;;;;;return System.Text.Encoding.GetEncoding(Detect(file)).CodePage;
;;;;}

;;;;public static string Detect(string file)
;;;;{
;;;;;;;;byte[] data=System.IO.File.ReadAllBytes(file);
;;;;;;;;if (data.Length > 2 && data[0] == 0xFF && data[1] == 0xFE){return "Unicode";}
;;;;;;;;if (data.Length > 2 && data[0] == 0xFE && data[1] == 0xFF){return "UTF-16BE";}
;;;;;;;;if (data.Length > 3 && data[0] == 0xEF && data[1] == 0xBB && data[2] == 0xBF){
;;;;;;;;;;;;return "UTF-8";
;;;;;;;;}else{
;;;;;;;;;;;;int charByteCounter = 1;
;;;;;;;;;;;;byte curByte;
;;;;;;;;;;;;for (int i = 0; i < data.Length; i++)
;;;;;;;;;;;;{
;;;;;;;;;;;;;;;;curByte = data[i];
;;;;;;;;;;;;;;;;if (charByteCounter == 1)
;;;;;;;;;;;;;;;;{
;;;;;;;;;;;;;;;;;;;;if (curByte >= 0x80)
;;;;;;;;;;;;;;;;;;;;{
;;;;;;;;;;;;;;;;;;;;;;;;while (((curByte <<= 1) & 0x80) != 0)
;;;;;;;;;;;;;;;;;;;;;;;;{
;;;;;;;;;;;;;;;;;;;;;;;;;;;;charByteCounter++;
;;;;;;;;;;;;;;;;;;;;;;;;}
;;;;;;;;;;;;;;;;;;;;;;;;if (charByteCounter == 1 || charByteCounter > 6)
;;;;;;;;;;;;;;;;;;;;;;;;{
;;;;;;;;;;;;;;;;;;;;;;;;;;;;return "GB2312";
;;;;;;;;;;;;;;;;;;;;;;;;}
;;;;;;;;;;;;;;;;;;;;}
;;;;;;;;;;;;;;;;}
;;;;;;;;;;;;;;;;else
;;;;;;;;;;;;;;;;{
;;;;;;;;;;;;;;;;;;;;if ((curByte & 0xC0) != 0x80)
;;;;;;;;;;;;;;;;;;;;{
;;;;;;;;;;;;;;;;;;;;;;;;return "GB2312";
;;;;;;;;;;;;;;;;;;;;}
;;;;;;;;;;;;;;;;;;;;charByteCounter--;
;;;;;;;;;;;;;;;;}
;;;;;;;;;;;;}
;;;;;;;;;;;;if (charByteCounter > 1)
;;;;;;;;;;;;{
;;;;;;;;;;;;;;;return "GB2312";
;;;;;;;;;;;;}
;;;;;;;;;;;;return "UTF-8";
;;;;;;;;}
;;;;}
}
#5
@Gintaras
Thanks for your help, it works well

I have a question
the example above has only one file, if there are two thousand files I'm going to traverse, each file is about 2M in size, the speed of using C# functions, and using QM code completely, about how much difference?  
What do you suggest? Thanks again
As I currently know, using powershell is slow Smile
#6
The fastest is pure C# with the new program.
This code with the C# Detect function is fast too.
The slowest would be the Detect function converted to QM.

To measure code speed, use PerfX functions.
Example 1:
Macro Macro3210
Code:
Copy      Help
PerfFirst
0.01; ;;code example 1
PerfNext
0.02; ;;code example 2
PerfNext
PerfOut
Example2:
Macro Macro3198
Code:
Copy      Help
PerfFirst
rep 3
,int codePage = CsFunc("" path)
,PerfNext
PerfOut
out codePage

To make this code faster when calling the C# function many times, replace the CsFunc line with:
Code:
Copy      Help
CsScript c.AddCode("") ;;once
int codePage = c.Call("DetectCP" path) ;;for each file
#7
Thanks a lot


Forum Jump:


Users browsing this thread: 1 Guest(s)