word extraction tool

Word Extraction Tool(1): Overview of Word Extraction Tool

Published September 24, 2022 · Updated Feb 20, 2023

An overview of word extraction tools that can be useful for data standardization work, especially for creating standard word candidates.

<<Table of Contents>>

1. Word Extraction Tool Overview

1.1. Why We Developed Word Extraction Tool

Among the initial tasks of data standardization, the most difficult task is to collect as many candidates as possible and quickly to be registered as standard words. Data Standards Checking Tool (see: Data standard inspection tool_1.Overview) can be used to extract standard word candidates, but it has the following difficulties.

If database table, column comment data contains many special characters (symbols such as #, $, %, ., \, etc., and line separators, etc.), considerable effort is required to remove or refine them.
It is difficult to know the frequency of words, so it is difficult to determine whether to register only single words, only compound words, or both single words and compound words.
If a compound word is later identified after a standard word is confirmed and affects the physical name of a standard term that has already been registered, exceptions to the standard naming rule can make management difficult.

The word extraction tool was developed to alleviate some of these difficulties. In particular, we hope that it will be helpful in the following cases.

If there is no current data standard dictionary or even if the number of standard words is small
Your job is so unique that there is no data standard dictionary that is suitable for reference.
When the database table and column comments are too large and it takes a lot of time to manually extract words
Or, conversely, when there is little content in database tables and column comments, so it is inappropriate to extract standard words and it is appropriate to extract them from documents such as work manuals.
In addition, if words and frequencies are required to be extracted from documents

1.2. word extraction tool concept

The word extraction tool is a tool that receives various types of files as input, extracts words and compound words using a natural language processing morpheme analyzer, and outputs the frequency and source (file name, table name, column name, etc.) as an Excel file.

Mecab, a Korean natural language processing (NLP) morpheme analyzer, was used and developed in Python v3.8. Kkma, Komoran, Hannanum, Okt (formerly known as Twitter), and Mecab are representative libraries among Korean natural language processing morpheme analyzers. Among them, Mecab was selected because it has the best performance.

Performance comparison of natural language processing morpheme analyzers can be found in the link below.

Reference: https://konlpy.org/ko/latest/morph/#comparison-between-pos-tagging-classes

한국어 형태소 분석기 성능 비교 (출처: 참조 Link에서 발췌) — Performance comparison of Korean morpheme analyzers (Source: Excerpt from Reference Link)

The execution time according to the increase in the number of input characters can be summarized as follows. (Execution time decreases and performance improves as you go from left to right)

Kkma > Komoran > Hannanum > Okt (Twitter) > Mecab

For reference, the above link is KoNLPy packageThis is the site of the person who developed it. KoNLPyis a Python-based package that bundles several morpheme analyzers into one.

KoNLPy: https://konlpy.org/ko/latest/

1.3. How the word extractor works

Briefly look at input data, processing logic, and output data.

1.3.1. word extraction tool input material

Input data can be specified in one or both of the following two ways.

Documents: MS Word, PowerPoint, Text files
- At the time of writing this article (2021-08-29), HWP and PDF formats are not yet supported.
DB Table, Column comment Source: Excel file
- Table comment data items: Database, Schema, Table Name, Table Comment
- Column comment data items: Database, Schema, Table Name, Table Comment, Column Name, Column Comment

▼ An example of table comment data is as follows.

Database	Schema	Table Name	Table Comment
DB1	OWNER1	COMTCADMINISTCODE	administrative code
DB1	OWNER1	COMTCADMINISTCODERECPTNLOG	Administration code reception log
DB1	OWNER1	COMTCCMNCLCODE	Common classification code
DB1	OWNER1	COMTCCMNCODE	common code
DB1	OWNER1	COMTCCMNDETAILCODE	Common detail code

Table comment data example

▼ Examples of column comment data are as follows. This is the column list of COMTCADMINISTCODE (administrative code) among the table list above.

Database	Schema	Table Name	Column Name	Column Comment
DB1	OWNER1	COMTCADMINISTCODE	ADMINIST_ZONE_SE	Administrative district division
DB1	OWNER1	COMTCADMINISTCODE	ADMINIST_ZONE_CODE	Administrative district code
DB1	OWNER1	COMTCADMINISTCODE	USE_AT	Whether or not to use
DB1	OWNER1	COMTCADMINISTCODE	ADMINIST_ZONE_NM	Administrative district name
DB1	OWNER1	COMTCADMINISTCODE	UPPER_ADMINIST_ZONE_CODE	Upper Administrative District Code
DB1	OWNER1	COMTCADMINISTCODE	CREAT_DE	creation date
DB1	OWNER1	COMTCADMINISTCODE	ABL_DE	abolition date
DB1	OWNER1	COMTCADMINISTCODE	FRST_REGIST_PNTTM	First registration
DB1	OWNER1	COMTCADMINISTCODE	FRST_REGISTER_ID	Initial registrant ID
DB1	OWNER1	COMTCADMINISTCODE	LAST_UPDT_PNTTM	Last modified time
DB1	OWNER1	COMTCADMINISTCODE	LAST_UPDUSR_ID	Last Modified ID

Column comment data example

* The example data above was created using the table and column comment scripts in the “Common Component Table Configuration Information” page of the e-Government Standard Framework v3.8.

(source: https://www.egovframe.go.kr/wiki/doku.php?id=egovframework:com:v3.8:init_table)

1.3.2. Word extraction tool processing logic

A brief summary of the entire processing logic is as follows.

Extract text (line by row, table/column) by opening input data sequentially
Extraction of word candidates in the form of 1 noun, n nouns, prefix + n nouns, n nouns + suffixes, prefix + n nouns + suffixes using Mecab, a natural language morpheme analyzer package
Find the frequency of words extracted from the entire input data, and save the word extraction result as an output file
Create and save word cloud as png file with word list and frequency
Outputs the total required time and exits

A simplified diagram of the above process is as follows.

단어 추출 도구 처리 과정 — Word extraction tool process

1.3.3. Word extraction tool output data

The output data, which is the result of processing the input data, is an Excel file and an image (png) file in the form of a word cloud.

The Excel file consists of two sheets. The following is an example of DB Table and Column comment data as input.

▼ “Example of word extraction result” sheet

No	word	FileName	FileType	Page	Text	DB	Schema	table	Column
1	administration	table,column comments.xlsx	column	0	Administrative district division	DB1	OWNER1	COMTCADMINISTCODE	ADMINIST_ZONE_SE
2	area	table,column comments.xlsx	column	0	Administrative district division	DB1	OWNER1	COMTCADMINISTCODE	ADMINIST_ZONE_SE
3	division	table,column comments.xlsx	column	0	Administrative district division	DB1	OWNER1	COMTCADMINISTCODE	ADMINIST_ZONE_SE
4	Administrative division [compound]	table,column comments.xlsx	column	0	Administrative district division	DB1	OWNER1	COMTCADMINISTCODE	ADMINIST_ZONE_SE
5	administration	table,column comments.xlsx	column	0	Administrative district code	DB1	OWNER1	COMTCADMINISTCODE	ADMINIST_ZONE_CODE
6	area	table,column comments.xlsx	column	0	Administrative district code	DB1	OWNER1	COMTCADMINISTCODE	ADMINIST_ZONE_CODE
7	code	table,column comments.xlsx	column	0	Administrative district code	DB1	OWNER1	COMTCADMINISTCODE	ADMINIST_ZONE_CODE
8	Administrative district code [compound]	table,column comments.xlsx	column	0	Administrative district code	DB1	OWNER1	COMTCADMINISTCODE	ADMINIST_ZONE_CODE
9	use	table,column comments.xlsx	column	0	Whether or not to use	DB1	OWNER1	COMTCADMINISTCODE	USE_AT
10	Whether	table,column comments.xlsx	column	0	Whether or not to use	DB1	OWNER1	COMTCADMINISTCODE	USE_AT
11	Whether to use [compound word]	table,column comments.xlsx	column	0	Whether or not to use	DB1	OWNER1	COMTCADMINISTCODE	USE_AT
12	region	table,column comments.xlsx	column	0	Administrative district name	DB1	OWNER1	COMTCADMINISTCODE	ADMINIST_ZONE_NM
13	station name	table,column comments.xlsx	column	0	Administrative district name	DB1	OWNER1	COMTCADMINISTCODE	ADMINIST_ZONE_NM
14	Administrative district name [compound]	table,column comments.xlsx	column	0	Administrative district name	DB1	OWNER1	COMTCADMINISTCODE	ADMINIST_ZONE_NM

“Example of word extraction result” sheet

“Text” column: This is the original value extracted from the input data, and in this example, it corresponds to table, column comment.
Column “Word”: Word candidates extracted from Text using Mecab. For compound words, specify “[compound word]” as a suffix.
- Line 12 “administration district”, line 13 “station name” are words extracted from “administration name” in Mecab.
- It can be seen that the accuracy is not 100% because it is extracted differently from the word actually used.

▼ “Word frequency” sheet example

word	Freq	Source
code	110	DB1.OWNER1.COMTCADMINISTCODE.ADMINIST_ZONE_CODE (administrative zone code) DB1.OWNER1.COMTCADMINISTCODERECPTNLOG.CHANGE_SE_CODE (change identification code) DB1.OWNER1.COMTCADMINISTCODERECPTNLOG.CTPRVN_CODE (attempt code) …
number	103	DB1.OWNER1.COMTCADMINISTCODERECPTNLOG.OPERT_SN(job serial number) DB1.OWNER1.COMTCZIP.ZIP (postal code) DB1.OWNER1.COMTHCONFMHISTORY.CONFM_NO (approval number) …
number of people	88	DB1.OWNER1.COMTNADBKMANAGE.ADBK_NM (address book name) DB1.OWNER1.COMTCCMMNCLCODE.CL_CODE_NM (classification code name) DB1.OWNER1.COMTCCMNDETAILCODE.CODE_NM (code name) …
Day	85	DB1.OWNER1.COMTCADMINISTCODE.CREAT_DE(creation date) DB1.OWNER1.COMTCADMINISTCODE.ABL_DE (retirement date) DB1.OWNER1.COMTCADMINISTCODERECPTNLOG.OCCRRNC_DE(occurrence date) …
information	77	DB1.OWNER1.COMTHDBMNTRNGLOGINFO.LOG_INFO (log information) DB1.OWNER1.COMTNBACKUPRESULT.ERROR_INFO (error information) DB1.OWNER1.COMTNINDVDLINFOPOLICY.INDVDL_INFO_POLICY_ID (Privacy Policy ID) …
Whether	75	DB1.OWNER1.COMTCADMINISTCODE.USE_AT (whether to use) DB1.OWNER1.COMTNANNVRSRYMANAGE.REPTIT_AT (whether to repeat) DB1.OWNER1.COMTNBANNER.REFLCT_AT (whether to reflect) …

Example of “Word Frequencies” sheet

“Word” column: This is a string value obtained by removing duplicates from the “Word” column of the “Word Extraction Result” sheet. This value is a candidate to be registered as a standard word.
Column “Freq”: This is the frequency count indicating how many times the word was used. The resulting list is sorted in reverse order from these high-frequency words to the low-frequency ones.
“Source” column: Indicates the source of the word. Displays up to 10 sources.
- If the source is Table, the format is: DB.Schema.TableName(Table comment)
- If the source is Column, the format is: DB.Schema.TableName.ColumnName(Column comment)
- If the source is File Format: Filename:PageNumber:Text

An example of a word cloud image generated by the frequency of extracted words is as follows. Words with high frequency are displayed in large size.

단어 추출 결과 Word cloud — Word extraction results Word cloud

The word extraction tool is a tool developed in Python, and requires an environment configuration process such as installing Python and necessary packages prior to execution. Next, we will look at the environment configuration process.

<< List of related articles >>

Tags: wordcloud word extraction word-extractor nlp stem analyzer python MeCab

Word Extraction Tool(1): Overview of Word Extraction Tool

1. Word Extraction Tool Overview