Word Extraction Tool(1): Overview of Word Extraction Tool

An overview of word extraction tools that can be useful for data standardization work, especially for creating standard word candidates.

1. Word Extraction Tool Overview

1.1. Why We Developed Word Extraction Tool

Among the initial tasks of data standardization, the most difficult task is to collect as many candidates as possible and quickly to be registered as standard words. Data Standards Checking Tool (see: Data standard inspection tool_1.Overview) can be used to extract standard word candidates, but it has the following difficulties.

  • If database table, column comment data contains many special characters (symbols such as #, $, %, ., \, etc., and line separators, etc.), considerable effort is required to remove or refine them.
  • It is difficult to know the frequency of words, so it is difficult to determine whether to register only single words, only compound words, or both single words and compound words.
  • If a compound word is later identified after a standard word is confirmed and affects the physical name of a standard term that has already been registered, exceptions to the standard naming rule can make management difficult.

The word extraction tool was developed to alleviate some of these difficulties. In particular, we hope that it will be helpful in the following cases.

  • If there is no current data standard dictionary or even if the number of standard words is small
  • Your job is so unique that there is no data standard dictionary that is suitable for reference.
  • When the database table and column comments are too large and it takes a lot of time to manually extract words
  • Or, conversely, when there is little content in database tables and column comments, so it is inappropriate to extract standard words and it is appropriate to extract them from documents such as work manuals.
  • In addition, if words and frequencies are required to be extracted from documents

1.2. word extraction tool concept

The word extraction tool is a tool that receives various types of files as input, extracts words and compound words using a natural language processing morpheme analyzer, and outputs the frequency and source (file name, table name, column name, etc.) as an Excel file.

Mecab, a Korean natural language processing (NLP) morpheme analyzer, was used and developed in Python v3.8. Kkma, Komoran, Hannanum, Okt (formerly known as Twitter), and Mecab are representative libraries among Korean natural language processing morpheme analyzers. Among them, Mecab was selected because it has the best performance.

Performance comparison of natural language processing morpheme analyzers can be found in the link below.

Reference: https://konlpy.org/ko/latest/morph/#comparison-between-pos-tagging-classes

한국어 형태소 분석기 성능 비교 (출처: 참조 Link에서 발췌)
Performance comparison of Korean morpheme analyzers (Source: Excerpt from Reference Link)

The execution time according to the increase in the number of input characters can be summarized as follows. (Execution time decreases and performance improves as you go from left to right)

Kkma > Komoran > Hannanum > Okt (Twitter) > Mecab

For reference, the above link is KoNLPy packageThis is the site of the person who developed it. KoNLPyis a Python-based package that bundles several morpheme analyzers into one.

KoNLPy: https://konlpy.org/ko/latest/

1.3. How the word extractor works

Briefly look at input data, processing logic, and output data.

1.3.1. word extraction tool input material

Input data can be specified in one or both of the following two ways.

  1. Documents: MS Word, PowerPoint, Text files
    • At the time of writing this article (2021-08-29), HWP and PDF formats are not yet supported.
  2. DB Table, Column comment Source: Excel file
    • Table comment data items: Database, Schema, Table Name, Table Comment
    • Column comment data items: Database, Schema, Table Name, Table Comment, Column Name, Column Comment

▼ An example of table comment data is as follows.

DatabaseSchemaTable NameTable Comment
DB1OWNER1COMTCADMINISTCODEadministrative code
DB1OWNER1COMTCADMINISTCODERECPTNLOGAdministration code reception log
DB1OWNER1COMTCCMNCLCODECommon classification code
DB1OWNER1COMTCCMNCODEcommon code
DB1OWNER1COMTCCMNDETAILCODECommon detail code
Table comment data example

▼ Examples of column comment data are as follows. This is the column list of COMTCADMINISTCODE (administrative code) among the table list above.

DatabaseSchemaTable NameColumn NameColumn Comment
DB1OWNER1COMTCADMINISTCODEADMINIST_ZONE_SEAdministrative district division
DB1OWNER1COMTCADMINISTCODEADMINIST_ZONE_CODEAdministrative district code
DB1OWNER1COMTCADMINISTCODEUSE_ATWhether or not to use
DB1OWNER1COMTCADMINISTCODEADMINIST_ZONE_NMAdministrative district name
DB1OWNER1COMTCADMINISTCODEUPPER_ADMINIST_ZONE_CODEUpper Administrative District Code
DB1OWNER1COMTCADMINISTCODECREAT_DEcreation date
DB1OWNER1COMTCADMINISTCODEABL_DEabolition date
DB1OWNER1COMTCADMINISTCODEFRST_REGIST_PNTTMFirst registration
DB1OWNER1COMTCADMINISTCODEFRST_REGISTER_IDInitial registrant ID
DB1OWNER1COMTCADMINISTCODELAST_UPDT_PNTTMLast modified time
DB1OWNER1COMTCADMINISTCODELAST_UPDUSR_IDLast Modified ID
Column comment data example

* The example data above was created using the table and column comment scripts in the “Common Component Table Configuration Information” page of the e-Government Standard Framework v3.8.

(source: https://www.egovframe.go.kr/wiki/doku.php?id=egovframework:com:v3.8:init_table)

1.3.2. Word extraction tool processing logic

A brief summary of the entire processing logic is as follows.

  1. Extract text (line by row, table/column) by opening input data sequentially
  2. Extraction of word candidates in the form of 1 noun, n nouns, prefix + n nouns, n nouns + suffixes, prefix + n nouns + suffixes using Mecab, a natural language morpheme analyzer package
  3. Find the frequency of words extracted from the entire input data, and save the word extraction result as an output file
  4. Create and save word cloud as png file with word list and frequency
  5. Outputs the total required time and exits

A simplified diagram of the above process is as follows.

단어 추출 도구 처리 과정
Word extraction tool process

1.3.3. Word extraction tool output data

The output data, which is the result of processing the input data, is an Excel file and an image (png) file in the form of a word cloud.

The Excel file consists of two sheets. The following is an example of DB Table and Column comment data as input.

▼ “Example of word extraction result” sheet

NowordFileNameFileTypePageTextDBSchematableColumn
1administrationtable,column comments.xlsxcolumn0Administrative district divisionDB1OWNER1COMTCADMINISTCODEADMINIST_ZONE_SE
2areatable,column comments.xlsxcolumn0Administrative district divisionDB1OWNER1COMTCADMINISTCODEADMINIST_ZONE_SE
3divisiontable,column comments.xlsxcolumn0Administrative district divisionDB1OWNER1COMTCADMINISTCODEADMINIST_ZONE_SE
4Administrative division [compound]table,column comments.xlsxcolumn0Administrative district divisionDB1OWNER1COMTCADMINISTCODEADMINIST_ZONE_SE
5administrationtable,column comments.xlsxcolumn0Administrative district codeDB1OWNER1COMTCADMINISTCODEADMINIST_ZONE_CODE
6areatable,column comments.xlsxcolumn0Administrative district codeDB1OWNER1COMTCADMINISTCODEADMINIST_ZONE_CODE
7codetable,column comments.xlsxcolumn0Administrative district codeDB1OWNER1COMTCADMINISTCODEADMINIST_ZONE_CODE
8Administrative district code [compound]table,column comments.xlsxcolumn0Administrative district codeDB1OWNER1COMTCADMINISTCODEADMINIST_ZONE_CODE
9usetable,column comments.xlsxcolumn0Whether or not to useDB1OWNER1COMTCADMINISTCODEUSE_AT
10Whethertable,column comments.xlsxcolumn0Whether or not to useDB1OWNER1COMTCADMINISTCODEUSE_AT
11Whether to use [compound word]table,column comments.xlsxcolumn0Whether or not to useDB1OWNER1COMTCADMINISTCODEUSE_AT
12regiontable,column comments.xlsxcolumn0Administrative district nameDB1OWNER1COMTCADMINISTCODEADMINIST_ZONE_NM
13station nametable,column comments.xlsxcolumn0Administrative district nameDB1OWNER1COMTCADMINISTCODEADMINIST_ZONE_NM
14Administrative district name [compound]table,column comments.xlsxcolumn0Administrative district nameDB1OWNER1COMTCADMINISTCODEADMINIST_ZONE_NM
“Example of word extraction result” sheet
  • “Text” column: This is the original value extracted from the input data, and in this example, it corresponds to table, column comment.
  • Column “Word”: Word candidates extracted from Text using Mecab. For compound words, specify “[compound word]” as a suffix.
    • Line 12 “administration district”, line 13 “station name” are words extracted from “administration name” in Mecab.
    • It can be seen that the accuracy is not 100% because it is extracted differently from the word actually used.

▼ “Word frequency” sheet example

wordFreqSource
code110DB1.OWNER1.COMTCADMINISTCODE.ADMINIST_ZONE_CODE (administrative zone code)
DB1.OWNER1.COMTCADMINISTCODERECPTNLOG.CHANGE_SE_CODE (change identification code)
DB1.OWNER1.COMTCADMINISTCODERECPTNLOG.CTPRVN_CODE (attempt code)
number103DB1.OWNER1.COMTCADMINISTCODERECPTNLOG.OPERT_SN(job serial number)
DB1.OWNER1.COMTCZIP.ZIP (postal code)
DB1.OWNER1.COMTHCONFMHISTORY.CONFM_NO (approval number)
number of people88DB1.OWNER1.COMTNADBKMANAGE.ADBK_NM (address book name)
DB1.OWNER1.COMTCCMMNCLCODE.CL_CODE_NM (classification code name)
DB1.OWNER1.COMTCCMNDETAILCODE.CODE_NM (code name)
Day85DB1.OWNER1.COMTCADMINISTCODE.CREAT_DE(creation date)
DB1.OWNER1.COMTCADMINISTCODE.ABL_DE (retirement date)
DB1.OWNER1.COMTCADMINISTCODERECPTNLOG.OCCRRNC_DE(occurrence date)
information77DB1.OWNER1.COMTHDBMNTRNGLOGINFO.LOG_INFO (log information)
DB1.OWNER1.COMTNBACKUPRESULT.ERROR_INFO (error information)
DB1.OWNER1.COMTNINDVDLINFOPOLICY.INDVDL_INFO_POLICY_ID (Privacy Policy ID)
Whether75DB1.OWNER1.COMTCADMINISTCODE.USE_AT (whether to use)
DB1.OWNER1.COMTNANNVRSRYMANAGE.REPTIT_AT (whether to repeat)
DB1.OWNER1.COMTNBANNER.REFLCT_AT (whether to reflect)
Example of “Word Frequencies” sheet
  • “Word” column: This is a string value obtained by removing duplicates from the “Word” column of the “Word Extraction Result” sheet. This value is a candidate to be registered as a standard word.
  • Column “Freq”: This is the frequency count indicating how many times the word was used. The resulting list is sorted in reverse order from these high-frequency words to the low-frequency ones.
  • “Source” column: Indicates the source of the word. Displays up to 10 sources.
    • If the source is Table, the format is: DB.Schema.TableName(Table comment)
    • If the source is Column, the format is: DB.Schema.TableName.ColumnName(Column comment)
    • If the source is File Format: Filename:PageNumber:Text

An example of a word cloud image generated by the frequency of extracted words is as follows. Words with high frequency are displayed in large size.

단어 추출 결과 Word cloud
Word extraction results Word cloud

The word extraction tool is a tool developed in Python, and requires an environment configuration process such as installing Python and necessary packages prior to execution. Next, we will look at the environment configuration process.


<< List of related articles >>

Leave a Reply

Your email address will not be published. Required fields are marked *

en_USEnglish