word extraction tool

Word Extraction Tool (6): Additional Description of Word Extraction Tool

Published September 25, 2022 · Updated October 10, 2022

Continuing from the previous article, we will look at the additional explanation of the word extraction tool.

Word Extraction Tool(5): Word Extraction Tool Source Code Description(2)

<<Table of Contents>>

5. Additional explanation of word extraction tool

5.1. Why use OLE Automation?

OLE Aumation is defined as follows in Wikipedia.

In Microsoft Windows application programming, OLE Automation (later renamed simply Automation[1][2]) is an interprocess communication (IPC) mechanism developed by Microsoft. It is based on a subset of the Component Object Model (COM) and was intended to be used through a scripting language (originally Visual Basic), but is now available through several languages on Windows.

source: https://en.wikipedia.org/wiki/OLE_Automation

In Python, OLE Automation is possible using the win32com package. You can control MS-Office applications to perform desired functions.

OLE Automation using Python win32com package

OLE Aumation was used for the word extraction tool for the following reasons.

There are the following dedicated packages that can read and write MS-Word and PowerPoint, but they were not used intentionally.
- MS-Word: python-docx, python-docx2txt
- PowerPoint: python-pptx
- Excel: openpyxl, xlsxwriter, pyxlsb
Most corporate environments force the installation of DRM software, so document files are encrypted.
If you use a dedicated package, you cannot read encrypted files.
If you use the OLE Automation method with the pywin32 package, you can read the file through the Office program.
If you use OLE Automation, you may lose some performance, but you can guarantee the results.

For the Python code that controls MS-Word, PowerPoint, and Excel, please refer to the following article.

MS-Word Automation: 4.3.1. get_doc_text function
PowerPoint Automation: 4.3.2._get_ppt_text_function
Excel Automation: 4.3.4._get_db_comment_text_function

5.2. Text file encoding related (only UTF-8 is supported)

The encoding of the input text file is set to support only UTF-8.
If the text file among the input files is ANSI encoding, it is cp949 and it is non-Unicode encoding, so the following error will occur.
- UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 5: invalid start byte
If a similar error occurs during execution, save the text file as UTF-8 and re-run.

5.3. Why use multi processing?

previous post “Word Extraction Tool(4): Word Extraction Tool Source Code Description(1)In “, the code using multi processing was explained.

4.2.3. Execute get_file_text with multi processing

4.2.4. Execute get_word_list with multi processing

When I first made this tool, both text extraction and word extraction were single processing. When it was first used in the K company project in early 2021, it took about 20 hours when extracting standard word candidates from about 160,000 column comments with a laptop (CPU i5, 16GB Ram).

Since it was repeatedly executed several times, including tests, it was necessary to shorten the execution time. When I googled to use Thread for parallel processing, I decided that multi-processing is more suitable than multi-threading because Python has a GIL (Global Interpreter Lock) concept.

Through slight code modification, multi processing was applied to both text extraction and word extraction, and an argument (multi_process_count) was created to specify the degree of parallelism during execution.

When multi_process_count was set to 8 and executed, the work that took 20 hours was shortened to about 40 minutes, which was sufficiently effective.

5.4. Notes on compound word extraction

The explanation of nouns, pos, and morphs among the main functions of Mecab is as follows.

function	Explanation
nouns(text)	Parses the text, extracts only the nouns, and returns a list
pos(text)	Parses the text and returns a list of shapes (morphemes, parts of speech, tags)
morphs(text)	Parses the text, extracts only the morphemes, and returns a list

The execution result of each function is as the example of the input string below.

* Input string: User defines functional and non-functional requirements.

function	Execution result
nouns(text)	['use', 'capabilities', 'requirements', 'assets', 'features', 'requirements', 'assets', 'definitions']
pos(text)	[('enable', 'NNG'), ('character', 'XSN'), ('is', 'JX'), ('feature', 'NNG'), ('red', 'XSN') , ('demand', 'NNG'), ('matter', 'NNG'), ('and', 'JC'), ('non', 'XPN'), ('feature', 'NNG') ; , ('should', 'XSV+EF'), ('.', 'SF')]
morphs(text)	['use', 'user', 'is', 'feature', 'nemesis', 'request', 'spec', 'and', 'non', 'feature', 'nemesis', 'request', ' matter', 'to', 'define', 'should', '.']

* Input string: Data standardization is an important area of data architecture construction.

function	Execution result
nouns(text)	['data', 'standards', 'data', 'architecture', 'build', 'critical', 'area']
pos(text)	[('data', 'NNG'), ('standard', 'NNG'), ('Tuesday', 'XSN'), ('is', 'JX'), ('data', 'NNG') , ('architecture', 'NNG'), ('build', 'NNG'), ('of', 'JKG'), ('important', 'NNG'), ('one', 'XSA+ETM '), ('region', 'NNG'), ('this', 'VCP'), ('da', 'EF'), ('.', 'SF')]
morphs(text)	['Data', 'Standard', 'Field', 'A', 'Data', 'Architecture', 'Construct', 'Of', 'Important', 'One', 'Region', 'This', ' all', '.']

The word extraction tool does not use the nouns function directly, but extracts words by applying a regular expression to the result of the pos function. Below is a description of the regular expression patterns.

Pattern using regular expression: '(NNP/|NNG/)+(XSN/)*|(XPN/)+(NNP/|NNG/)+(XSN/)*|(SL/)+'

This pattern finds one of three things:
- (NNP/|NNG/)+(XSN/)*: (proper or common noun) 1 or more (required) + 0 or more noun-derived suffixes (optional)
- (XPN/)+(NNP/|NNG/)+(XSN/)*: 1 or more verb prefixes (required) + 1 or more (proper or common nouns) (required) + 0 or more noun-derived suffixes (optional) )
- (SL/)+: At least one foreign language (required)

Examples of extracting only nouns by calling nouns(text) and extracting additional compound words by applying regular expressions are as follows:

* Input string: User defines functional and non-functional requirements.

function	Execution result
extract only nouns nouns(text)	use, function, need, matter, function, need, matter, definition
Apply regular expression	use, function, need, requirement, function, requirement, requirement, definition, user[compound], functional[compound], requirement[compound], non-functional[compound], requirement[compound]

* Input string: Data standardization is an important area of data architecture construction.

function	Execution result
extract only nouns nouns(text)	data, standard, data, architecture, build, important, area
Apply regular expression	Data, standard, data, architecture, construction, important, domain, data standardization [compound], data architecture construction [compound]

The reason for additionally extracting the compound word is to prevent a problem in the case where the compound word is added later by making it possible to review whether to register the compound word as a standard at the initial stage of building the standard word dictionary.

If a compound word is added later, the physical name of the standard term using the individual words constituting the compound word may be changed, and even the table name and column name of the database may need to be changed using the standard term.

Of course, there is a way to change only the physical name of the standard term to be created without changing the physical name of the standard term that has already been created. It is recommended from a long-term perspective.

It is not an easy decision to change the physical name of a standard term that has already been created. If the development is already in progress, you must proceed with changing the source code that refers to the column whose name has been changed. Projects that require additional timelines and involve multiple stakeholders may be subject to accountability.

Considering that if the amount of source code to be changed is large, it can have a significant impact on the project and can be quite difficult.

There was no way to identify compound word candidates while well aware of the impact, but this word extraction tool provides a suitable method.

Although it may not be optimal, I think it is sufficient as an alternative at the present time.

5.5. morpheme analyzer part-of-speech type

previous post Word Extraction Tool(5): Word Extraction Tool Source Code Description(2) of 4.4._get_word_list_function had the following contents:

Line 64: Execute part-of-speech tagging of the morpheme analyzer with the pos function. I will separate the contents related to part-of-speech tagging.
- The part-of-speech tagging function pos decomposes the input string into parts-of-speech units and returns a string in which each unit is tagged.
- For example, if the text is 'Users define functional and non-functional requirements', the execution result of the pos function is '[('use', 'NNG'), ('character', 'XSN '), ('is', 'JX'), ('function', 'NNG'), ('enemy', 'XSN'), ('request', 'NNG'), ('spec', 'NNG' '), ('and', 'JC'), ('b', 'XPN'), ('feature', 'NNG'), ('enemies', 'XSN'), ('request', 'NNG '), ('thing', 'NNG'), ('to', 'JKO'), ('definition', 'NNG'), ('should', 'XSV+EF'), ('.', 'SF')]'.
- Among the parts of speech tagged in the example above, 'NNG' is a common noun, 'XSN' is a noun-derived suffix, 'JX' is an auxiliary, 'JC' is a connective particle, 'XPN' is a prefix, 'JKO' is an objective particle, ' XSV + EF' is a verb derived suffix + final ending, and 'SF' means a period/question mark/exclamation mark.

The parts-of-speech tags provided by Mecab are summarized in the following documents.

https://docs.google.com/spreadsheets/d/1-9blXKjtjeKZqsf4NzHeYJCrr49-nXeRF6D80udfcwY/edit#gid=589544265

Excerpts from the above document are summarized below.

No real meaning	Large category (five words + other)	Sejong Part of Speech Tags		mecab-en-dic part of speech tag
No real meaning	Large category (five words + other)	tag	Explanation	tag	Explanation
real morpheme	Cheon	NNG	common noun	NNG	common noun
NNP	proper noun	NNP	proper noun
NNB	dependent noun	NNB	dependent noun
NNBC	unit noun
NR	Investigation	NR	Investigation
NP	pronoun	NP	pronoun
verb	VV	verb	VV	verb
VA	adjective	VA	adjective
VX	auxiliary verb	VX	auxiliary verb
VCP	affirmative adjective	VCP	affirmative adjective
VCN	negative designator	VCN	negative designator
modifier	MM	detective	MM	detective
MAG	common adverb	MAG	common adverb
MAJ	conjunctive adverb	MAJ	conjunctive adverb
independent language	IC	interjection	IC	interjection
formal morpheme	relation	JKS	nominative investigation	JKS	nominative investigation
JKC	complementary investigation	JKC	complementary investigation
JKG	tubular case study	JKG	tubular case study
JKO	purposeful investigation	JKO	purposeful investigation
JKB	sub-firing investigation	JKB	sub-firing investigation
JKV	scrutiny	JKV	scrutiny
JKQ	Quotation Investigation	JKQ	Quotation Investigation
JX	assistant	JX	assistant
JC	connection investigation	JC	connection investigation
fresh fish mother	EP	fresh fish mother	EP	fresh fish mother
mother	EF	terminating suffix	EF	terminating suffix
EC	connective ending	EC	connective ending
ETN	noun form ending	ETN	noun form ending
ETM	tubular malleable mother	ETM	tubular malleable mother
prefix	XPN	adjective prefix	XPN	adjective prefix
suffix	XSN	noun-derived suffixes	XSN	noun-derived suffixes
XSV	verb derivation suffixes	XSV	verb derivation suffixes
XSA	adjective-derived suffixes	XSA	adjective-derived suffixes
	radix	XR	radix	XR	radix
sign	sci-fi	full stop, question mark, Exclamation mark	sci-fi	full stop, question mark, Exclamation mark
SE	ellipsis	SE	ellipsis …
SS	quotes,parentheses,line	SSO	opening parenthesis (, [
SSC	closing parenthesis ), ]
SP	rest,middle point,colon,hatch	SC	Separator , · / :
SO	hyphen(wave,hiding,missing)	SY
SW	Other symbols (logical math symbol,currency symbol)
other than Korean	SL	Foreign language	SL	Foreign language
SH	chinese character	SH	chinese character
SN	number	SN	number

This concludes the article on the word extraction tool. If features are added or improved, a separate article will be written.

<< List of related articles >>

Tags: python MeCab word extraction word-extractor stem analyzer natural language processing

Word Extraction Tool (6): Additional Description of Word Extraction Tool