Word Extraction Tool (6): Additional Description of Word Extraction Tool
Continuing from the previous article, we will look at the additional explanation of the word extraction tool.
Word Extraction Tool(5): Word Extraction Tool Source Code Description(2)
5. Additional explanation of word extraction tool
5.1. Why use OLE Automation?
OLE Aumation is defined as follows in Wikipedia.
In Microsoft Windows application programming, OLE Automation (later renamed simply Automation[1][2]) is an interprocess communication (IPC) mechanism developed by Microsoft. It is based on a subset of the Component Object Model (COM) and was intended to be used through a scripting language (originally Visual Basic), but is now available through several languages on Windows.
source: https://en.wikipedia.org/wiki/OLE_Automation
In Python, OLE Automation is possible using the win32com package. You can control MS-Office applications to perform desired functions.
OLE Aumation was used for the word extraction tool for the following reasons.
- There are the following dedicated packages that can read and write MS-Word and PowerPoint, but they were not used intentionally.
- MS-Word: python-docx, python-docx2txt
- PowerPoint: python-pptx
- Excel: openpyxl, xlsxwriter, pyxlsb
- Most corporate environments force the installation of DRM software, so document files are encrypted.
- If you use a dedicated package, you cannot read encrypted files.
- If you use the OLE Automation method with the pywin32 package, you can read the file through the Office program.
- If you use OLE Automation, you may lose some performance, but you can guarantee the results.
For the Python code that controls MS-Word, PowerPoint, and Excel, please refer to the following article.
- MS-Word Automation: 4.3.1. get_doc_text function
- PowerPoint Automation: 4.3.2._get_ppt_text_function
- Excel Automation: 4.3.4._get_db_comment_text_function
5.2. Text file encoding related (only UTF-8 is supported)
- The encoding of the input text file is set to support only UTF-8.
- If the text file among the input files is ANSI encoding, it is cp949 and it is non-Unicode encoding, so the following error will occur.
- UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 5: invalid start byte
- If a similar error occurs during execution, save the text file as UTF-8 and re-run.
5.3. Why use multi processing?
previous post “Word Extraction Tool(4): Word Extraction Tool Source Code Description(1)In “, the code using multi processing was explained.
4.2.3. Execute get_file_text with multi processing
4.2.4. Execute get_word_list with multi processing
When I first made this tool, both text extraction and word extraction were single processing. When it was first used in the K company project in early 2021, it took about 20 hours when extracting standard word candidates from about 160,000 column comments with a laptop (CPU i5, 16GB Ram).
Since it was repeatedly executed several times, including tests, it was necessary to shorten the execution time. When I googled to use Thread for parallel processing, I decided that multi-processing is more suitable than multi-threading because Python has a GIL (Global Interpreter Lock) concept.
Through slight code modification, multi processing was applied to both text extraction and word extraction, and an argument (multi_process_count) was created to specify the degree of parallelism during execution.
When multi_process_count was set to 8 and executed, the work that took 20 hours was shortened to about 40 minutes, which was sufficiently effective.
5.4. Notes on compound word extraction
The explanation of nouns, pos, and morphs among the main functions of Mecab is as follows.
function | Explanation |
nouns(text) | Parses the text, extracts only the nouns, and returns a list |
pos(text) | Parses the text and returns a list of shapes (morphemes, parts of speech, tags) |
morphs(text) | Parses the text, extracts only the morphemes, and returns a list |
The execution result of each function is as the example of the input string below.
* Input string: User defines functional and non-functional requirements.
function | Execution result |
nouns(text) | ['use', 'capabilities', 'requirements', 'assets', 'features', 'requirements', 'assets', 'definitions'] |
pos(text) | [('enable', 'NNG'), ('character', 'XSN'), ('is', 'JX'), ('feature', 'NNG'), ('red', 'XSN') , ('demand', 'NNG'), ('matter', 'NNG'), ('and', 'JC'), ('non', 'XPN'), ('feature', 'NNG') ; , ('should', 'XSV+EF'), ('.', 'SF')] |
morphs(text) | ['use', 'user', 'is', 'feature', 'nemesis', 'request', 'spec', 'and', 'non', 'feature', 'nemesis', 'request', ' matter', 'to', 'define', 'should', '.'] |
* Input string: Data standardization is an important area of data architecture construction.
function | Execution result |
nouns(text) | ['data', 'standards', 'data', 'architecture', 'build', 'critical', 'area'] |
pos(text) | [('data', 'NNG'), ('standard', 'NNG'), ('Tuesday', 'XSN'), ('is', 'JX'), ('data', 'NNG') , ('architecture', 'NNG'), ('build', 'NNG'), ('of', 'JKG'), ('important', 'NNG'), ('one', 'XSA+ETM '), ('region', 'NNG'), ('this', 'VCP'), ('da', 'EF'), ('.', 'SF')] |
morphs(text) | ['Data', 'Standard', 'Field', 'A', 'Data', 'Architecture', 'Construct', 'Of', 'Important', 'One', 'Region', 'This', ' all', '.'] |
The word extraction tool does not use the nouns function directly, but extracts words by applying a regular expression to the result of the pos function. Below is a description of the regular expression patterns.
Pattern using regular expression: '(NNP/|NNG/)+(XSN/)*|(XPN/)+(NNP/|NNG/)+(XSN/)*|(SL/)+'
- This pattern finds one of three things:
- (NNP/|NNG/)+(XSN/)*: (proper or common noun) 1 or more (required) + 0 or more noun-derived suffixes (optional)
- (XPN/)+(NNP/|NNG/)+(XSN/)*: 1 or more verb prefixes (required) + 1 or more (proper or common nouns) (required) + 0 or more noun-derived suffixes (optional) )
- (SL/)+: At least one foreign language (required)
Examples of extracting only nouns by calling nouns(text) and extracting additional compound words by applying regular expressions are as follows:
* Input string: User defines functional and non-functional requirements.
function | Execution result |
extract only nouns nouns(text) | use, function, need, matter, function, need, matter, definition |
Apply regular expression | use, function, need, requirement, function, requirement, requirement, definition, user[compound], functional[compound], requirement[compound], non-functional[compound], requirement[compound] |
* Input string: Data standardization is an important area of data architecture construction.
function | Execution result |
extract only nouns nouns(text) | data, standard, data, architecture, build, important, area |
Apply regular expression | Data, standard, data, architecture, construction, important, domain, data standardization [compound], data architecture construction [compound] |
The reason for additionally extracting the compound word is to prevent a problem in the case where the compound word is added later by making it possible to review whether to register the compound word as a standard at the initial stage of building the standard word dictionary.
If a compound word is added later, the physical name of the standard term using the individual words constituting the compound word may be changed, and even the table name and column name of the database may need to be changed using the standard term.
Of course, there is a way to change only the physical name of the standard term to be created without changing the physical name of the standard term that has already been created. It is recommended from a long-term perspective.
It is not an easy decision to change the physical name of a standard term that has already been created. If the development is already in progress, you must proceed with changing the source code that refers to the column whose name has been changed. Projects that require additional timelines and involve multiple stakeholders may be subject to accountability.
Considering that if the amount of source code to be changed is large, it can have a significant impact on the project and can be quite difficult.
There was no way to identify compound word candidates while well aware of the impact, but this word extraction tool provides a suitable method.
Although it may not be optimal, I think it is sufficient as an alternative at the present time.
5.5. morpheme analyzer part-of-speech type
previous post Word Extraction Tool(5): Word Extraction Tool Source Code Description(2) of 4.4._get_word_list_function had the following contents:
- Line 64: Execute part-of-speech tagging of the morpheme analyzer with the pos function. I will separate the contents related to part-of-speech tagging.
- The part-of-speech tagging function pos decomposes the input string into parts-of-speech units and returns a string in which each unit is tagged.
- For example, if the text is 'Users define functional and non-functional requirements', the execution result of the pos function is '[('use', 'NNG'), ('character', 'XSN '), ('is', 'JX'), ('function', 'NNG'), ('enemy', 'XSN'), ('request', 'NNG'), ('spec', 'NNG' '), ('and', 'JC'), ('b', 'XPN'), ('feature', 'NNG'), ('enemies', 'XSN'), ('request', 'NNG '), ('thing', 'NNG'), ('to', 'JKO'), ('definition', 'NNG'), ('should', 'XSV+EF'), ('.', 'SF')]'.
- Among the parts of speech tagged in the example above, 'NNG' is a common noun, 'XSN' is a noun-derived suffix, 'JX' is an auxiliary, 'JC' is a connective particle, 'XPN' is a prefix, 'JKO' is an objective particle, ' XSV + EF' is a verb derived suffix + final ending, and 'SF' means a period/question mark/exclamation mark.
The parts-of-speech tags provided by Mecab are summarized in the following documents.
Excerpts from the above document are summarized below.
No real meaning | Large category (five words + other) | Sejong Part of Speech Tags | mecab-en-dic part of speech tag | ||
tag | Explanation | tag | Explanation | ||
real morpheme | Cheon | NNG | common noun | NNG | common noun |
NNP | proper noun | NNP | proper noun | ||
NNB | dependent noun | NNB | dependent noun | ||
NNBC | unit noun | ||||
NR | Investigation | NR | Investigation | ||
NP | pronoun | NP | pronoun | ||
verb | VV | verb | VV | verb | |
VA | adjective | VA | adjective | ||
VX | auxiliary verb | VX | auxiliary verb | ||
VCP | affirmative adjective | VCP | affirmative adjective | ||
VCN | negative designator | VCN | negative designator | ||
modifier | MM | detective | MM | detective | |
MAG | common adverb | MAG | common adverb | ||
MAJ | conjunctive adverb | MAJ | conjunctive adverb | ||
independent language | IC | interjection | IC | interjection | |
formal morpheme | relation | JKS | nominative investigation | JKS | nominative investigation |
JKC | complementary investigation | JKC | complementary investigation | ||
JKG | tubular case study | JKG | tubular case study | ||
JKO | purposeful investigation | JKO | purposeful investigation | ||
JKB | sub-firing investigation | JKB | sub-firing investigation | ||
JKV | scrutiny | JKV | scrutiny | ||
JKQ | Quotation Investigation | JKQ | Quotation Investigation | ||
JX | assistant | JX | assistant | ||
JC | connection investigation | JC | connection investigation | ||
fresh fish mother | EP | fresh fish mother | EP | fresh fish mother | |
mother | EF | terminating suffix | EF | terminating suffix | |
EC | connective ending | EC | connective ending | ||
ETN | noun form ending | ETN | noun form ending | ||
ETM | tubular malleable mother | ETM | tubular malleable mother | ||
prefix | XPN | adjective prefix | XPN | adjective prefix | |
suffix | XSN | noun-derived suffixes | XSN | noun-derived suffixes | |
XSV | verb derivation suffixes | XSV | verb derivation suffixes | ||
XSA | adjective-derived suffixes | XSA | adjective-derived suffixes | ||
radix | XR | radix | XR | radix | |
sign | sci-fi | full stop, question mark, Exclamation mark | sci-fi | full stop, question mark, Exclamation mark | |
SE | ellipsis | SE | ellipsis … | ||
SS | quotes,parentheses,line | SSO | opening parenthesis (, [ | ||
SSC | closing parenthesis ), ] | ||||
SP | rest,middle point,colon,hatch | SC | Separator , · / : | ||
SO | hyphen(wave,hiding,missing) | SY | |||
SW | Other symbols (logical math symbol,currency symbol) | ||||
other than Korean | SL | Foreign language | SL | Foreign language | |
SH | chinese character | SH | chinese character | ||
SN | number | SN | number |
This concludes the article on the word extraction tool. If features are added or improved, a separate article will be written.
<< List of related articles >>
- Word Extraction Tool(1): Overview of Word Extraction Tool
- Word Extraction Tool (2): Configure the Word Extraction Tool Execution Environment
- Word Extraction Tool (3): How to Run the Word Extraction Tool and Check the Results
- Word Extraction Tool(4): Word Extraction Tool Source Code Description(1)
- Word Extraction Tool(5): Word Extraction Tool Source Code Description(2)
- Word Extraction Tool (6): Additional Description of Word Extraction Tool
- Full Contents of Word Extraction Tool Description , Download