Word Extraction Tool (6): Additional Description of Word Extraction Tool

Continuing from the previous article, we will look at the additional explanation of the word extraction tool.

Word Extraction Tool(5): Word Extraction Tool Source Code Description(2)

5. Additional explanation of word extraction tool

5.1. Why use OLE Automation?

OLE Aumation is defined as follows in Wikipedia.

In Microsoft Windows application programming, OLE Automation (later renamed simply Automation[1][2]) is an interprocess communication (IPC) mechanism developed by Microsoft. It is based on a subset of the Component Object Model (COM) and was intended to be used through a scripting language (originally Visual Basic), but is now available through several languages on Windows.

source: https://en.wikipedia.org/wiki/OLE_Automation

In Python, OLE Automation is possible using the win32com package. You can control MS-Office applications to perform desired functions.

OLE Automation using Python win32com package
OLE Automation using Python win32com package

OLE Aumation was used for the word extraction tool for the following reasons.

  • There are the following dedicated packages that can read and write MS-Word and PowerPoint, but they were not used intentionally.
    • MS-Word: python-docx, python-docx2txt
    • PowerPoint: python-pptx
    • Excel: openpyxl, xlsxwriter, pyxlsb
  • Most corporate environments force the installation of DRM software, so document files are encrypted.
  • If you use a dedicated package, you cannot read encrypted files.
  • If you use the OLE Automation method with the pywin32 package, you can read the file through the Office program.
  • If you use OLE Automation, you may lose some performance, but you can guarantee the results.

For the Python code that controls MS-Word, PowerPoint, and Excel, please refer to the following article.

5.2. Text file encoding related (only UTF-8 is supported)

  • The encoding of the input text file is set to support only UTF-8.
  • If the text file among the input files is ANSI encoding, it is cp949 and it is non-Unicode encoding, so the following error will occur.
    • UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 5: invalid start byte
  • If a similar error occurs during execution, save the text file as UTF-8 and re-run.

5.3. Why use multi processing?

previous post “Word Extraction Tool(4): Word Extraction Tool Source Code Description(1)In “, the code using multi processing was explained.

4.2.3. Execute get_file_text with multi processing

4.2.4. Execute get_word_list with multi processing

When I first made this tool, both text extraction and word extraction were single processing. When it was first used in the K company project in early 2021, it took about 20 hours when extracting standard word candidates from about 160,000 column comments with a laptop (CPU i5, 16GB Ram).

Since it was repeatedly executed several times, including tests, it was necessary to shorten the execution time. When I googled to use Thread for parallel processing, I decided that multi-processing is more suitable than multi-threading because Python has a GIL (Global Interpreter Lock) concept.

Through slight code modification, multi processing was applied to both text extraction and word extraction, and an argument (multi_process_count) was created to specify the degree of parallelism during execution.

When multi_process_count was set to 8 and executed, the work that took 20 hours was shortened to about 40 minutes, which was sufficiently effective.


5.4. Notes on compound word extraction

The explanation of nouns, pos, and morphs among the main functions of Mecab is as follows.

functionExplanation
nouns(text)Parses the text, extracts only the nouns, and returns a list
pos(text)Parses the text and returns a list of shapes (morphemes, parts of speech, tags)
morphs(text)Parses the text, extracts only the morphemes, and returns a list

The execution result of each function is as the example of the input string below.

* Input string: User defines functional and non-functional requirements.

functionExecution result
nouns(text)['use', 'capabilities', 'requirements', 'assets', 'features', 'requirements', 'assets', 'definitions']
pos(text)[('enable', 'NNG'), ('character', 'XSN'), ('is', 'JX'), ('feature', 'NNG'), ('red', 'XSN') , ('demand', 'NNG'), ('matter', 'NNG'), ('and', 'JC'), ('non', 'XPN'), ('feature', 'NNG') ; , ('should', 'XSV+EF'), ('.', 'SF')]
morphs(text)['use', 'user', 'is', 'feature', 'nemesis', 'request', 'spec', 'and', 'non', 'feature', 'nemesis', 'request', ' matter', 'to', 'define', 'should', '.']

* Input string: Data standardization is an important area of data architecture construction.

functionExecution result
nouns(text)['data', 'standards', 'data', 'architecture', 'build', 'critical', 'area']
pos(text)[('data', 'NNG'), ('standard', 'NNG'), ('Tuesday', 'XSN'), ('is', 'JX'), ('data', 'NNG') , ('architecture', 'NNG'), ('build', 'NNG'), ('of', 'JKG'), ('important', 'NNG'), ('one', 'XSA+ETM '), ('region', 'NNG'), ('this', 'VCP'), ('da', 'EF'), ('.', 'SF')]
morphs(text)['Data', 'Standard', 'Field', 'A', 'Data', 'Architecture', 'Construct', 'Of', 'Important', 'One', 'Region', 'This', ' all', '.']

The word extraction tool does not use the nouns function directly, but extracts words by applying a regular expression to the result of the pos function. Below is a description of the regular expression patterns.

Pattern using regular expression: '(NNP/|NNG/)+(XSN/)*|(XPN/)+(NNP/|NNG/)+(XSN/)*|(SL/)+'

  • This pattern finds one of three things:
    • (NNP/|NNG/)+(XSN/)*: (proper or common noun) 1 or more (required) + 0 or more noun-derived suffixes (optional)
    • (XPN/)+(NNP/|NNG/)+(XSN/)*: 1 or more verb prefixes (required) + 1 or more (proper or common nouns) (required) + 0 or more noun-derived suffixes (optional) )
    • (SL/)+: At least one foreign language (required)

Examples of extracting only nouns by calling nouns(text) and extracting additional compound words by applying regular expressions are as follows:

* Input string: User defines functional and non-functional requirements.

functionExecution result
extract only nouns
nouns(text)
use, function, need, matter, function, need, matter, definition
Apply regular expressionuse, function, need, requirement, function, requirement, requirement, definition, user[compound], functional[compound], requirement[compound], non-functional[compound], requirement[compound]

* Input string: Data standardization is an important area of data architecture construction.

functionExecution result
extract only nouns
nouns(text)
data, standard, data, architecture, build, important, area
Apply regular expressionData, standard, data, architecture, construction, important, domain, data standardization [compound], data architecture construction [compound]

The reason for additionally extracting the compound word is to prevent a problem in the case where the compound word is added later by making it possible to review whether to register the compound word as a standard at the initial stage of building the standard word dictionary.

If a compound word is added later, the physical name of the standard term using the individual words constituting the compound word may be changed, and even the table name and column name of the database may need to be changed using the standard term.

Of course, there is a way to change only the physical name of the standard term to be created without changing the physical name of the standard term that has already been created. It is recommended from a long-term perspective.

It is not an easy decision to change the physical name of a standard term that has already been created. If the development is already in progress, you must proceed with changing the source code that refers to the column whose name has been changed. Projects that require additional timelines and involve multiple stakeholders may be subject to accountability.

Considering that if the amount of source code to be changed is large, it can have a significant impact on the project and can be quite difficult.

There was no way to identify compound word candidates while well aware of the impact, but this word extraction tool provides a suitable method.

Although it may not be optimal, I think it is sufficient as an alternative at the present time.

5.5. morpheme analyzer part-of-speech type

previous post Word Extraction Tool(5): Word Extraction Tool Source Code Description(2) of 4.4._get_word_list_function had the following contents:

  • Line 64: Execute part-of-speech tagging of the morpheme analyzer with the pos function. I will separate the contents related to part-of-speech tagging.
    • The part-of-speech tagging function pos decomposes the input string into parts-of-speech units and returns a string in which each unit is tagged.
    • For example, if the text is 'Users define functional and non-functional requirements', the execution result of the pos function is '[('use', 'NNG'), ('character', 'XSN '), ('is', 'JX'), ('function', 'NNG'), ('enemy', 'XSN'), ('request', 'NNG'), ('spec', 'NNG' '), ('and', 'JC'), ('b', 'XPN'), ('feature', 'NNG'), ('enemies', 'XSN'), ('request', 'NNG '), ('thing', 'NNG'), ('to', 'JKO'), ('definition', 'NNG'), ('should', 'XSV+EF'), ('.', 'SF')]'.
    • Among the parts of speech tagged in the example above, 'NNG' is a common noun, 'XSN' is a noun-derived suffix, 'JX' is an auxiliary, 'JC' is a connective particle, 'XPN' is a prefix, 'JKO' is an objective particle, ' XSV + EF' is a verb derived suffix + final ending, and 'SF' means a period/question mark/exclamation mark.

The parts-of-speech tags provided by Mecab are summarized in the following documents.

https://docs.google.com/spreadsheets/d/1-9blXKjtjeKZqsf4NzHeYJCrr49-nXeRF6D80udfcwY/edit#gid=589544265

Excerpts from the above document are summarized below.

No real meaningLarge category (five words + other)Sejong Part of Speech Tagsmecab-en-dic part of speech tag
tagExplanationtagExplanation
real morphemeCheonNNGcommon nounNNGcommon noun
NNPproper nounNNPproper noun
NNBdependent nounNNBdependent noun
NNBCunit noun
NRInvestigationNRInvestigation
NPpronounNPpronoun
verbVVverbVVverb
VAadjectiveVAadjective
VXauxiliary verbVXauxiliary verb
VCPaffirmative adjectiveVCPaffirmative adjective
VCNnegative designatorVCNnegative designator
modifierMMdetectiveMMdetective
MAGcommon adverbMAGcommon adverb
MAJconjunctive adverbMAJconjunctive adverb
independent languageICinterjectionICinterjection
formal morphemerelationJKSnominative investigationJKSnominative investigation
JKCcomplementary investigationJKCcomplementary investigation
JKGtubular case studyJKGtubular case study
JKOpurposeful investigationJKOpurposeful investigation
JKBsub-firing investigationJKBsub-firing investigation
JKVscrutinyJKVscrutiny
JKQQuotation InvestigationJKQQuotation Investigation
JXassistantJXassistant
JCconnection investigationJCconnection investigation
fresh fish motherEPfresh fish motherEPfresh fish mother
motherEFterminating suffixEFterminating suffix
ECconnective endingECconnective ending
ETNnoun form endingETNnoun form ending
ETMtubular malleable mother ETMtubular malleable mother
prefixXPNadjective prefixXPNadjective prefix
suffixXSNnoun-derived suffixesXSNnoun-derived suffixes
XSVverb derivation suffixesXSVverb derivation suffixes
XSAadjective-derived suffixesXSAadjective-derived suffixes
 radixXRradixXRradix
signsci-fifull stop, question mark, Exclamation marksci-fifull stop, question mark, Exclamation mark
SEellipsisSEellipsis
SSquotes,parentheses,lineSSOopening parenthesis (, [
SSCclosing parenthesis ), ]
SPrest,middle point,colon,hatchSCSeparator , · / :
SOhyphen(wave,hiding,missing)SY 
SWOther symbols (logical math symbol,currency symbol)
other than KoreanSLForeign languageSLForeign language
SHchinese characterSHchinese character
SNnumberSNnumber

This concludes the article on the word extraction tool. If features are added or improved, a separate article will be written.


<< List of related articles >>

Leave a Reply

Your email address will not be published. Required fields are marked *

en_USEnglish