A function to get the number of UTF-8 bytes and characters in a Sybase ASE string
Let's take a look at the functions to find the number of UTF-8 bytes and characters in Sybase ASE DBMS. This function was created and used in a project that converts EUC-KSC, a Sybase ASE Non-Unicode encoding, to UTF-8, a Unicode encoding.
* UTF-8 encoding Note: UTF-8 – Wikipedia, the everyone's encyclopedia (wikipedia.org)
When converting a column declared as varchar(10) in EUC-KSC to UTF-8, it is necessary to check the maximum number of bytes and characters.
When converting to UTF-8, the maximum number of bytes and characters are necessary to determine how much to declare the maximum length of a string column. This function was used to calculate the maximum value by calculating the number of bytes and characters converted to UTF-8 encoding for the value stored in EUC-KSC encoding, which is not the declared length.
The DBMS was tested on Sybase ASE v15.5. I haven't tested it on other versions, but I think it should work.
Sybase ASE is very similar to MS-SQL, so you can use it without any modification in MS-SQL.
1. UTF-8 byte count function
IF OBJECT_ID('dbo.GET_UTF8_BYTE') IS NOT NULL DROP FUNCTION dbo.GET_UTF8_BYTE GO CREATE FUNCTION dbo.GET_UTF8_BYTE ( @IN_VAL VARCHAR(4000) ) RETURNS INT AS BEGIN DECLARE @CHK_LENGTH INT, @I INT, @CHK_BYTE INT, @BYTE_LEN INT, @UTF8_BYTE_LEN INT, @SUM_UTF8_BYTE_LEN INT SELECT @I = 1, @CHK_LENGTH = LEN(@IN_VAL), @SUM_UTF8_BYTE_LEN = 0 WHILE @I <= @CHK_LENGTH BEGIN SELECT @CHK_BYTE = ASCII(SUBSTRING(@IN_VAL, @I, 1)) IF @CHK_BYTE >= 128 -- ASCII 코드가 alpha, numeric, control 문자 범위를 넘는 경우(2 Byte 문자일 경우) SELECT @BYTE_LEN = 2, @UTF8_BYTE_LEN = 3 ELSE SELECT @BYTE_LEN = 1, @UTF8_BYTE_LEN = 1 SELECT @SUM_UTF8_BYTE_LEN = @SUM_UTF8_BYTE_LEN + @UTF8_BYTE_LEN, @I = @I + @BYTE_LEN END RETURN @SUM_UTF8_BYTE_LEN END GO
This code was written based on the following facts.
- If ASCII code of Non-Unicode 1 Byte is 128 or higher, 2-byte character (Korean, Chinese, Japanese, etc.)
- Non-Unicode Hangul 1 character (2 bytes) is stored as 3 bytes in UTF-8 encoding
The explanation of the main code is as follows.
- Line 18: Read the input string byte by byte.
- Line 21: Set length of bytes to skip if ASCII code of bytes read is greater than or equal to 128, increase number of UTF-8 bytes by 3
- Line 23: If ASCII code of bytes read is less than 128, set length of bytes to skip to 1, increment number of UTF-8 bytes by 1
- Line 25: Accumulate the sum of UTF-8 bytes, increment the value of the condition variable @I in the WHILE loop by the length of bytes to skip
2. Function to find the number of UTF-8 characters
The above function can be modified as follows to obtain the number of characters rather than the number of UTF-8 bytes.
IF OBJECT_ID('dbo.GET_CHAR_CNT') IS NOT NULL DROP FUNCTION dbo.GET_CHAR_CNT GO CREATE FUNCTION dbo.GET_CHAR_CNT ( @IN_VAL VARCHAR(4000) ) RETURNS INT AS BEGIN DECLARE @CHK_LENGTH INT, @I INT, @CHK_BYTE INT, @BYTE_LEN INT, @UTF8_BYTE_LEN INT, @SUM_UTF8_BYTE_LEN INT, @SUM_CHAR_CNT INT SELECT @I = 1, @CHK_LENGTH = LEN(@IN_VAL), @SUM_UTF8_BYTE_LEN = 0, @SUM_CHAR_CNT = 0 WHILE @I <= @CHK_LENGTH BEGIN SELECT @CHK_BYTE = ASCII(SUBSTRING(@IN_VAL, @I, 1)) IF @CHK_BYTE >= 128 -- ASCII 코드가 alpha, numeric, control 문자 범위를 넘는 경우(2 Byte 문자일 경우) SELECT @BYTE_LEN = 2, @UTF8_BYTE_LEN = 3 ELSE SELECT @BYTE_LEN = 1, @UTF8_BYTE_LEN = 1 SELECT @SUM_UTF8_BYTE_LEN = @SUM_UTF8_BYTE_LEN + @UTF8_BYTE_LEN, @I = @I + @BYTE_LEN SELECT @SUM_CHAR_CNT = @SUM_CHAR_CNT + 1 END --RETURN @SUM_UTF8_BYTE_LEN RETURN @SUM_CHAR_CNT END
The WHILE loop starting from line 16 ends with the @I variable as a condition. The @BYTE_LEN variable is set to 2 (double-byte character) or 1 (single-byte character) depending on the ASCII code of the byte read, and is added to @I in line 25. With this logic, the WHILE loop statement is repeatedly executed as many times as the number of characters.
The reason why @SUM_CHAR_CNT is incremented by 1 in line 26 is that the number of repetitions is the number of characters.