Data standard check tool_4.Attachment

This is the appendix of the data standard inspection tool. You can check the processing of the number suffix of the property name, standard word dictionary simulation method, reference information of properties subject to standard inspection, license, and VBA code. This is a continuation of the previous article.

Data Standard Check Tool_3. Standard Check Case Results

4. Attachment of data standard inspection tool

4.1. Processing of numeric suffix in attribute name

If there is a number suffix (postfix number) in the property name, the number is excluded and the standard term dictionary or standard word dictionary is searched. A numeric suffix is added after the result of the found standard term or standard word combination.

속성명의 숫자 Suffix에 대한 처리
Processing of numeric suffix in attribute name

In the figure above, the physical name of the attribute “personal targeting 1” is created as “PERSN_TGT_YN1”. The attribute name terminator was normally identified as “whether or not”.

The processing logic for the numeric suffix of the property name is as follows.

  • Find and store only the consecutive numbers (a) from the back of the property name to be checked, delete the numbers, and inspect
  • As a result of inspection, the number a stored in “standard word logical name combination” and “standard word physical name combination” is appended to the right

4.2. Standard word dictionary simulation method

In order to check whether the word combination for the input property name succeeds when the word physical name is not determined, that is, whether it can be registered as a standard term, a logical name is registered in the “Standard Word Dictionary” sheet, and the physical name is “(” + logical It can be simulated by composing name + “)”. In this case, the “standard word physical name combination” item in the standard check result is displayed in green.

표준단어사전 Simulation 방법
Standard word dictionary simulation method

If you click the “Add to Word Dictionary” button in the standard check result, the physical name of each word is created as “(” + logical name + “)” for all “additional candidate words” and added to the word dictionary. Words that have already been registered are not added.

Incidentally, if you want to automatically generate an abbreviation for the physical name of a word, you can use the user function “Generate Abbreviation” included in this VBA project. The source code of the function is as follows.

Public Function 약어생성(aFullName As String, Optional aMaxSize = 4) As String
    Dim sFullName As String, sAbbName As String
    Dim i As Integer
    
    '영문 Full Name 중에 포함된 '-' 는 공백으로 변경하여 처리한다.
    sFullName = Trim(Replace(aFullName, "-", " "))
    
    '/* 영문 Full Name이 숫자로 시작하는 경우 단어 순서를 변경한다. (첫번째, 두번째 단어 순서만 변경) */
    Select Case Mid(sFullName, 1, 1)
        Case "0" To "9"
            sFullName = Trim(단어순서변경(sFullName))
    End Select

    sFullName = StrConv(sFullName, vbProperCase)
    sFullName = Replace(sFullName, "A ", "")
    sFullName = Replace(sFullName, "An ", "")
    sFullName = Replace(sFullName, "The ", "")

    If InStr(1, sFullName, " ") = 0 Then '/* 영문FullName이 하나의 단어일 경우*/
        If Len(sFullName) <= 4 Then '/* 영문FullName의 길이가 4 이하 일 경우 */
            sAbbName = sFullName
        Else
            '/* 영문FullName의 4자 이후 단어들이 모두 모음인 경우 - 앞 4자리 단어를 축약어로 사용 */
            If 모음제거(Mid(sFullName, 5)) = "" Then
                sAbbName = Mid(sFullName, 1, 4)
            Else
                sAbbName = Mid(sFullName, 1, 1) + 모음제거(Mid(sFullName, 2))   '-- 모음시작 단어 첫모음 살림
                '/* 축약 결과 길이제한 : 최대 aMaxSize 자리까지 */
                If Len(sAbbName) > aMaxSize Then
                    sAbbName = Mid(sAbbName, 1, aMaxSize)
                End If
            End If

        End If
    Else '/* 영문FullName이 여러 단어로 구성되어 있을 경우 */
        Dim iBlankCnt As Integer
        iBlankCnt = Len(sFullName) - Len(Replace(sFullName, " ", ""))

        '각 단어 첫글자
        sAbbName = sFullName

        Dim iPos As Integer, sChar As String, sAbbTmp As String
        If iBlankCnt = 1 Then '/* 단어수가 2개인 경우 - 각 단어의 시작 2글자 조합으로 생성 */
            sAbbName = 모음제거(sAbbName)
            i = InStr(1, sAbbName, " ") + 1
            sAbbName = UCase(Left(sAbbName, 2) + Mid(sAbbName, i, 2))
        ElseIf iBlankCnt <= 4 Then '/* 단어수가 4개 이하인 경우 - 첫글자 조합으로 생성 */
             For iPos = 1 To Len(sAbbName)
                sChar = Mid(sAbbName, iPos, 1)
                If (sChar >= "A" And sChar <= "Z") Or (sChar >= "0" And sChar <= "9") Then
                    sAbbTmp = sAbbTmp + sChar
                End If
                If Len(sAbbTmp) >= aMaxSize Then GoTo Exit_For1
             Next iPos
Exit_For1:
             sAbbName = sAbbTmp
        Else
            '/* 단어의 길이가 최대 허용자리수 이하인 경우 - 그대로 사용 */
            If Len(Replace(sFullName, " ", "")) <= aMaxSize Then
                sAbbName = UCase(Replace(sFullName, " ", ""))
            Else
                Dim sInText As String
                sInText = 등위접속사정리(UCase(sFullName))
                '/* 등위접속사가 존재하면 첫글자들로 구성 */
                If sInText <> sFullName Then
                    sAbbName = 모음제거(sInText)
                    For iPos = 1 To Len(sAbbName)
                        sChar = Mid(sAbbName, iPos, 1)
                        If (sChar >= "A" And sChar <= "Z") Or (sChar >= "0" And sChar <= "9") Then
                            sAbbTmp = sAbbTmp + sChar
                        End If
                        If Len(sAbbTmp) >= aMaxSize Then GoTo Exit_For2
                    Next iPos
Exit_For2:
                    sAbbName = sAbbTmp
                Else
                    '/* 두단어로 구성된 경우, 각각의 단어에서 앞 2자리 문자를 가지고 약어 생성 */
                    Dim iUnitLen As Integer, aWordArray() As String, sWord As String
                    iUnitLen = 2
                    aWordArray = Split(sFullName)
                    For iPos = 0 To UBound(aWordArray)
                        sWord = aWordArray(iPos)
                        If Len(sWord) = iUnitLen Then
                            sAbbTmp = sAbbTmp + sWord
                        Else
                            If 모음제거(Mid(sWord, iUnitLen + 1)) = "" Then
                                sAbbTmp = sAbbTmp + Mid(sWord, 1, iUnitLen)
                            Else
                                sAbbTmp = sAbbTmp + Mid(Mid(sWord, 1, 1) + 모음제거(이중자음정리(Mid(sWord, 2))), 1, iUnitLen)
                            End If
                        End If
                    Next
                    sAbbName = sAbbTmp
                End If
            End If

        End If

    End If
    약어생성 = UCase(sAbbName)
End Function

4.3. Description of reference information of standard inspection target properties

Property names are changed or data types/lengths are changed during standardization. Items to manage reference information such as property name before change, column name, data type/length, or notes during standardization work are needed.

On the right side of the “Property name standardization” sheet standard inspection results, items are created to manage additional information on standard inspection results and additional reference information required for each property name. For reference information, the user can add necessary items.

표준점검대상 속성의 참조정보
Reference information of properties subject to standard inspection

▼ Standards Maintenance Additional Information

  • Physical Name Length: The length of the physical name created as a result of standard inspection. Displays the background color of cells exceeding the maximum length (e.g. 26) in red in conditional formatting according to the physical name maximum length option setting.
  • Number of words: Number of combined words of physical names created as a result of standard inspection
  • _Number: The number of '_' characters in the physical name created as a result of standard inspection
  • Attribute Name CNT: The number of entered attribute names (if there is a duplicated attribute name, that is, if the attribute name CNT is 2 or more, the background color of the cell is displayed in red)

▼ Additional reference information

  • Property name before change: The property name initially entered before standard inspection of property names.
    • It is not automatically set, but after creating a list of property names, the user directly sets by copying and pasting.
    • It is used to check what the property name was originally entered when changing the property name while executing the standard check several times, and it is recommended to use it whenever possible.
  • Property name after change: Current property name (automatically set as a formula)
  • As-Is property name: As-Is property name of the entered property name (required for As-Is tracking)
  • As-Is column name: As-Is column name of the entered property name (required for As-Is tracking)
  • As-Is DataType: As-Is Data Type of the entered property name (required for As-Is tracking)
  • Whether or not the property name has been changed: Compare the property name before and after the change, and mark “O” if it has changed
  • Items to be confirmed: Freely record and utilize items that need to be checked during standard inspection work
  • Others (add as needed): Add a column to the right of “Necessary to check” to manage required items

4.4. license

There are no restrictions on users who need and want to use this standard inspection tool. The content of the license was written briefly by applying the BSD-2 license.

  • Author: Youngmin Lee
  • email: leeyoungmin@gmail.com
  • blog: https://prodskill.com/
  • github: https://github.com/DAToolset
  • License Content: Basically follows the BSD-2 license specified in the github repository. If not specified, follow the instructions below
    • 1. Representation of the copyright holder: The name of the copyright holder, email address, blog address, and github repository address must be used without change. In case of changing the source code and using it, you can add content related to the secondary author.
    • 2. Warranty Disclaimer: The copyright holder cannot be held liable for any problems caused by using this source code and tools.
    • If the above conditions are satisfied, it is permitted for all purposes, such as personal use such as learning/utilization, use of this tool in business at work, or use of the tool to be included in a commercial product by changing the source code.
    • This license content cannot be removed.
  • tell me
    • You do not have to disclose the source code of the derivative work. However, if it is not a problem, I would like you to let me know which secondary work you used it for. You can send it by email or leave it as a blog comment.
    • Please also notify us of errors or improvement requests in the tool's functionality, errors in the guide documentation, or requests for improvement by e-mail or blog comments.

For your reference, the license of the open tool is announced in the blog post below.

https://prodtool.tistory.com/notice/19

#ToDo: Change Link

4.5. Organize your VBA code

WorkSheet, UserForm, Module, Class list. The VBA code of this tool is composed of Module and Class to improve reusability and maintainability. According to the MVC design pattern, M (Model) is Class, V (View) is Excel UI, and C (Control) is Module was applied.

VBA 코드 구성
VBA code configuration

▼ Worksheet Object

  • Program Sheet: shtRun (property name standardization): Sheet that manages the function and property list for property name standard inspection
  • Data Sheet: Sheet referenced by the program: standard word dictionary, standard term dictionary, standard domain dictionary

▼ Form Object

  • frmAbout: “About… ” Form that is displayed when clicking the button. Contains instructional text

▼ Module Object

  • modControlStd: Functions related to standard inspection, such as creating standard inspection results and adding candidate words
  • modTest: Test code
  • modUtil: Common functions (logging, formatting, suffix removal, etc.)
  • Mod abbreviation generation: Create an abbreviation by inputting the English full name of the word and the maximum number of digits of the abbreviation

▼ Class Module Object

  • CDBConnectionUnit: Class that provides functions such as DB connection, SQL execution and result creation to update standard words/terms/domains from standard Repository DB
  • CStdDomain: Class of one standard domain unit
  • CStdDomainDic
    • Class for searching the standard domain list as a Dictionary structure (Key-Value)
    • Key: domain classification name
    • Value: Domain Collection of the corresponding domain taxonomy
  • CStdTerm: class of one standard term
  • CStdTermDic
    • Class for searching standard term list as Dictionary structure (Key-Value)
    • Key: term logic name
    • Value: an instance of the CStdTerm class
    • Read the standard terminology dictionary and load it into memory as a dictionary structure
  • CStdWord: Class of one standard word
  • CStdWordCol: Class for managing a list of homophones for a standard word
  • CStdWordDic
    • Class for searching standard word list as Dictionary structure (Key-Value)
    • Key: word logical name
    • Value: instance collection of CStdWord class (CStdWordCol object)
    • Internally manages additional dictionary with physical word name as key
    • In the process of reading the standard word dictionary and loading it into memory as a dictionary, duplicate logical names and duplicate physical names are checked.

4.6. Main VBA Class Description and Overall Logic Flow Summary

The main classes are “CStdWordDic” that manages the standard word dictionary, “CStdTermDic” that manages the standard term dictionary, and “CStdDomainDic” that manages the standard domain dictionary. The entire logic is handled in the “standard check” procedure of the “modControlStd” module.

In summary, it is as follows.

4.6.1. CStdWordDic Class

CStdWordDic 구성
Configure CStdWordDic

CStdWordDic class manages the list of each standard word by using the member variable of Dictionary type. Key is word name (word logic name), Value has one word object (CStdWord type) in general case. When duplicates are allowed for logical names of words, that is, when homonyms are allowed, Value is a Collection object that holds multiple word objects (CStdWord type). Internally, it additionally manages a dictionary with the physical name of the word as the key and the word object as the value.

In order to determine whether there is a homophone, whether to classify by the type of the Value object, the Value object to be the same as a Collection, and to distinguish whether the item count of the Collection is 1 or 2 or more, It was classified by type.

So, in the “standard check” procedure of the “modControlStd” module, the following type of code appears in several places.

If TypeOf oStdWordObj Is CStdWord Then
    '논리명이 유일한 경우
    Set oStdWord = oStdWordObj
    sToken = oStdWord.m_s단어논리명
    ...
ElseIf TypeOf oStdWordObj Is CStdWordCol Then
    '동음이의어가 있어 논리명이 2개 이상인 경우
    b동음이의어Matched = True
    ...
End If

Please refer to the following about the Dictionary type.

  • The Dictionary type of Excel VBA has a HashMap structure composed of Key and Value.
  • It is not a basic type, and it is used after referring to the “Microsoft Scripting Runtime” library, or declared as an object type and created with CreateObject (“Scripting.Dictionary”) to create an instance.

4.6.2. CStdTermDic Class

CStdTermDic 구성
Configure CStdTermDic

CStdTermDic class manages the list of each standard term using Dictionary type member variable. Key is the logical name of the term, and Value is the term entity (CStdTerm type). It is used for the logic that determines whether term units match during property standard inspection.

4.6.3. CStdDomainDic Class

CStdDomainDic 구성
Configure CStdDomainDic

CStdDomainDic class manages the list of each standard domain using Dictionary type member variable. Key is domain classification name, Value is domain collection. The items of the domain collection are CStdDomain type domain objects.

4.6.4. Overall logic flow summary

전체 로직 흐름 요약
Overall logic flow summary

The entire logic is implemented in the "Standard Check" procedure of the module modControl. A brief summary of the process is as follows.

  1. Read data from each standard word/term/domain dictionary sheet and create variables of each class type. It is implemented in the Load() procedure of each class type.
  2. Reads the list of properties to be checked and stores them in Variant array.
  3. For each property to be checked, it is searched using the standard dictionary key-value structure, and the inspection result is stored in a Variant type array. When processing is finished, the inspection results stored in Variant type are output on the sheet.

This concludes the description of the data standard checking tool. If you have any questions, please leave them in the comments.


The data standards check tool can be found in the github repository below.

https://github.com/DAToolset/ToolsForDataStandard


<< List of related articles >>

Leave a Reply

Your email address will not be published. Required fields are marked *

en_USEnglish