A function to get the number of UTF-8 bytes and characters in a Sybase ASE string

Let's take a look at the functions to find the number of UTF-8 bytes and characters in Sybase ASE DBMS. This function was created and used in a project that converts EUC-KSC, a Sybase ASE Non-Unicode encoding, to UTF-8, a Unicode encoding.

* UTF-8 encoding Note: UTF-8 – Wikipedia, the everyone's encyclopedia (wikipedia.org)

When converting a column declared as varchar(10) in EUC-KSC to UTF-8, it is necessary to check the maximum number of bytes and characters.

When converting to UTF-8, the maximum number of bytes and characters are necessary to determine how much to declare the maximum length of a string column. This function was used to calculate the maximum value by calculating the number of bytes and characters converted to UTF-8 encoding for the value stored in EUC-KSC encoding, which is not the declared length.

The DBMS was tested on Sybase ASE v15.5. I haven't tested it on other versions, but I think it should work.

Sybase ASE is very similar to MS-SQL, so you can use it without any modification in MS-SQL.

1. UTF-8 byte count function

IF OBJECT_ID('dbo.GET_UTF8_BYTE') IS NOT NULL
  DROP FUNCTION dbo.GET_UTF8_BYTE
GO

CREATE FUNCTION dbo.GET_UTF8_BYTE
(
    @IN_VAL VARCHAR(4000)
)
RETURNS INT
AS
BEGIN
  DECLARE @CHK_LENGTH INT, @I INT, @CHK_BYTE INT, @BYTE_LEN INT, @UTF8_BYTE_LEN INT, @SUM_UTF8_BYTE_LEN INT

  SELECT @I = 1, @CHK_LENGTH = LEN(@IN_VAL), @SUM_UTF8_BYTE_LEN = 0

  WHILE @I <= @CHK_LENGTH
  BEGIN
    SELECT @CHK_BYTE = ASCII(SUBSTRING(@IN_VAL, @I, 1))

    IF @CHK_BYTE >= 128 -- ASCII 코드가 alpha, numeric, control 문자 범위를 넘는 경우(2 Byte 문자일 경우)
        SELECT @BYTE_LEN = 2, @UTF8_BYTE_LEN = 3
    ELSE
        SELECT @BYTE_LEN = 1, @UTF8_BYTE_LEN = 1

    SELECT @SUM_UTF8_BYTE_LEN = @SUM_UTF8_BYTE_LEN + @UTF8_BYTE_LEN, @I = @I + @BYTE_LEN
  END

  RETURN @SUM_UTF8_BYTE_LEN
END
GO

This code was written based on the following facts.

  • If ASCII code of Non-Unicode 1 Byte is 128 or higher, 2-byte character (Korean, Chinese, Japanese, etc.)
  • Non-Unicode Hangul 1 character (2 bytes) is stored as 3 bytes in UTF-8 encoding

The explanation of the main code is as follows.

  • Line 18: Read the input string byte by byte.
  • Line 21: Set length of bytes to skip if ASCII code of bytes read is greater than or equal to 128, increase number of UTF-8 bytes by 3
  • Line 23: If ASCII code of bytes read is less than 128, set length of bytes to skip to 1, increment number of UTF-8 bytes by 1
  • Line 25: Accumulate the sum of UTF-8 bytes, increment the value of the condition variable @I in the WHILE loop by the length of bytes to skip

2. Function to find the number of UTF-8 characters

The above function can be modified as follows to obtain the number of characters rather than the number of UTF-8 bytes.

IF OBJECT_ID('dbo.GET_CHAR_CNT') IS NOT NULL
  DROP FUNCTION dbo.GET_CHAR_CNT
GO

CREATE FUNCTION dbo.GET_CHAR_CNT
(
    @IN_VAL VARCHAR(4000)
)
RETURNS INT
AS
BEGIN
  DECLARE @CHK_LENGTH INT, @I INT, @CHK_BYTE INT, @BYTE_LEN INT, @UTF8_BYTE_LEN INT, @SUM_UTF8_BYTE_LEN INT, @SUM_CHAR_CNT INT

  SELECT @I = 1, @CHK_LENGTH = LEN(@IN_VAL), @SUM_UTF8_BYTE_LEN = 0, @SUM_CHAR_CNT = 0

  WHILE @I <= @CHK_LENGTH
  BEGIN
    SELECT @CHK_BYTE = ASCII(SUBSTRING(@IN_VAL, @I, 1))

    IF @CHK_BYTE >= 128 -- ASCII 코드가 alpha, numeric, control 문자 범위를 넘는 경우(2 Byte 문자일 경우)
        SELECT @BYTE_LEN = 2, @UTF8_BYTE_LEN = 3
    ELSE
        SELECT @BYTE_LEN = 1, @UTF8_BYTE_LEN = 1

    SELECT @SUM_UTF8_BYTE_LEN = @SUM_UTF8_BYTE_LEN + @UTF8_BYTE_LEN, @I = @I + @BYTE_LEN
    SELECT @SUM_CHAR_CNT = @SUM_CHAR_CNT + 1
  END

  --RETURN @SUM_UTF8_BYTE_LEN
  RETURN @SUM_CHAR_CNT
END

The WHILE loop starting from line 16 ends with the @I variable as a condition. The @BYTE_LEN variable is set to 2 (double-byte character) or 1 (single-byte character) depending on the ASCII code of the byte read, and is added to @I in line 25. With this logic, the WHILE loop statement is repeatedly executed as many times as the number of characters.

The reason why @SUM_CHAR_CNT is incremented by 1 in line 26 is that the number of repetitions is the number of characters.

Leave a Reply

Your email address will not be published. Required fields are marked *

en_USEnglish