Computational linguistics. Theoretical and computer lexicography

Computational linguistics: methods, resources, applications

Introduction

Term computational linguistics(CL) has become increasingly common in recent years in connection with the development of various application software systems, including commercial software products. This is due to the rapid growth of textual information in society, including on the Internet, and the need for automatic processing of texts in natural language (NL). This circumstance stimulates the development of computer linguistics as a field of science and the development of new information and linguistic technologies.

Within the framework of computational linguistics, which has existed for more than 50 years (and is also known as machine linguistics, automatic text processing in NL) many promising methods and ideas have been proposed, but not all of them have yet found their expression in software products used in practice. Our goal is to characterize the specifics of this area of research, formulate its main tasks, indicate its connections with other sciences, give a brief overview of the main approaches and resources used, and also briefly characterize existing applications of CL. For a more detailed introduction to these issues, we can recommend books.

1. Problems of computational linguistics

Computer linguistics arose at the intersection of such sciences as linguistics, mathematics, computer science (Computer Science) and artificial intelligence. The origins of CL go back to the research of the famous American scientist N. Chomsky in the field of formalizing the structure of natural language; its development is based on results in the field of general linguistics (linguistics). Linguistics studies the general laws of natural language - its structure and functioning, and includes the following areas:

Ø Phonology– studies speech sounds and the rules for combining them during speech formation;

Ø Morphology– deals with the internal structure and external form of words of speech, including parts of speech and their categories;

Ø Syntax– studies the structure of sentences, rules of compatibility and order of words in a sentence, as well as its general properties as a unit of language.

Ø Semanticsand pragmatics– closely related areas: semantics deals with the meaning of words, sentences and other units of speech, and pragmatics deals with the peculiarities of expressing this meaning in connection with specific goals of communication;

Ø Lexicography describes the lexicon of a particular NL - its individual words and their grammatical properties, as well as methods for creating dictionaries.

The results of N. Chomsky, obtained at the intersection of linguistics and mathematics, laid the foundation for the theory of formal languages and grammars (often called generative, or generating grammarians). This theory now applies to mathematical linguistics and is used for processing not so much NL, but artificial languages, primarily programming languages. By its nature, this is a completely mathematical discipline.

Mathematical linguistics also includes quantitative linguistics, which studies the frequency characteristics of language - words, their combinations, syntactic structures, etc., and uses mathematical methods of statistics, so this branch of science can be called statistical linguistics.

CL is also closely related to such an interdisciplinary scientific field as artificial intelligence (AI), within which computer models of individual intellectual functions are developed. One of the first working programs in the field of AI and CL is the famous program of T. Winograd, which understood the simplest human orders to change the world of cubes, formulated on a limited subset of NL. Note that despite the obvious intersection of research in the field of CL and AI (since language proficiency relates to intellectual functions), AI does not absorb all CL, since it has its own theoretical basis and methodology. What these sciences have in common is computer modeling as the main method and final goal of research.

Thus, the CL task can be formulated as the development of computer programs for automatic processing of texts in NL. And although processing is understood quite broadly, not all types of processing can be called linguistic, and the corresponding processors - linguistic. Linguistic processor must use one or another formal model of language (even a very simple one), which means it must be language-dependent in one way or another (i.e., depend on a specific NL). So, for example, the Mycrosoft Word text editor can be called linguistic (if only because it uses dictionaries), but the NotePad editor cannot.

The complexity of CL tasks is due to the fact that NL is a complex multi-level system of signs that arose for the exchange of information between people, developed in the process of human practical activity, and constantly changing in connection with this activity. Another difficulty in developing CL methods (and the difficulty of studying NL within the framework of linguistics) is associated with the diversity of natural languages, significant differences in their vocabulary, morphology, syntax; different languages provide different ways of expressing the same meaning.

2. Features of the NL system: levels and connections

The object of linguistic processors is NL texts. Texts are understood as any samples of speech - oral and written, of any genre, but mainly CL considers written texts. The text has a one-dimensional, linear structure, and also carries a certain meaning, while language acts as a means of transforming the transmitted meaning into texts (speech synthesis) and vice versa (speech analysis). The text is composed of smaller units, and there are several possible ways of dividing (dividing) the text into units belonging to different levels.

The existence of the following levels is generally accepted:

· level of proposals (statements) – syntactic level;

· Lexico-morphological homonymy (the most common type) occurs when the word forms of two different lexemes coincide, for example, poem– a verb in the singular, masculine, and a noun in the singular, nominative case),

· Syntactic homonymy means the ambiguity of the syntactic structure, which leads to several interpretations: Students from Lvov went to Kyiv,Flying planes can be dangerous(famous example of Chomsky), etc.

3. Modeling in computational linguistics

The development of a linguistic processor (LP) involves a description of the linguistic properties of the processed NL text, and this description is organized as model language. As with modeling in mathematics and programming, a model is understood as a certain system that displays a number of essential properties of the phenomenon being modeled (i.e., SE) and therefore has structural or functional similarity.

Language models used in CL are usually built on the basis of theories created by linguists by studying various texts and on the basis of their linguistic intuition (introspection). What is the specificity of the CL models? The following features can be distinguished:

· Formality and, ultimately, algorithmizability;

· Functionality (the purpose of modeling is to reproduce the functions of language as a “black box”, without building an accurate model of human speech synthesis and analysis);

· The generality of the model, i.e., it takes into account a fairly large set of texts;

· Experimental validity, which involves testing the model on different texts;

· Reliance on dictionaries as an obligatory component of the model.

The complexity of the NL, its description and processing leads to the division of this process into separate stages corresponding to the levels of the language. Most modern LPs are of the modular type, in which each level of linguistic analysis or synthesis corresponds to a separate processor module. In particular, in the case of text analysis, individual LP modules perform:

Ø Graphematic analysis, i.e. highlighting word forms in the text (transition from symbols to words);

Ø Morphological analysis – transition from word forms to their lemmas(dictionary forms of lexemes) or basics(nuclear parts of the word, minus inflectional morphemes);

Ø Syntactic analysis, i.e. identifying the grammatical structure of text sentences;

Ø Semantic and pragmatic analysis, which determines the meaning of phrases and the corresponding reaction of the system within which the LP operates.

Different schemes for the interaction of these modules are possible (sequential operation or parallel intermittent analysis), however, individual levels - morphology, syntax and semantics are still processed by different mechanisms.

Thus, LP can be considered as a multi-stage converter, which, in the case of text analysis, translates each of its sentences into an internal representation of its meaning and vice versa in the case of synthesis. The corresponding language model can be called structural.

Although complete CL models require taking into account all the main levels of the language and the presence of corresponding modules, when solving some applied problems it is possible to do without representing individual levels in LP. For example, in early experimental CL programs, the processed texts belonged to very narrow problem areas (with a limited set of words and their strict order), so that their initial letters could be used to recognize words, omitting the stages of morphological and syntactic analysis.

Another example of a reduced model, now quite often used, is the language model of the frequency of symbols and their combinations (digrams, trigrams, etc.) in the texts of a particular NL. Such statistical model displays linguistic information at the level of characters (letters) of the text, and it is sufficient, for example, to identify typos in the text or to recognize its linguistic identity. A similar model based on the statistics of individual words and their co-occurrence in texts (bigrams, trigrams of words) is used, for example, to resolve lexical ambiguity or determine the part of speech of a word (in languages like English).

Note that it is possible structural-statistical models, in which, when representing individual levels of the NL, one or another statistics is taken into account - words, syntactic structures, etc.

In a modular-type LP, at each stage of text analysis or synthesis, a corresponding model (morphology, syntax, etc.) is used.

The morphological models for analyzing word forms existing in CL differ mainly in the following parameters:

· the result of the work - a lemma or stem with a set of morphological characteristics (gender, number, case, aspect, person, etc.) of a given word form;

· method of analysis - based on a dictionary of word forms of a language or a dictionary of basics, or a dictionary-free method;

· the ability to process the word form of a lexeme not included in the dictionary.

In morphological synthesis, the initial data are the lexeme and specific morphological characteristics of the requested word form of this lexeme; a request for the synthesis of all forms of a given lexeme is also possible. The result of both morphological analysis and synthesis is generally ambiguous.

To model syntax within the framework of CL, a large number of different ideas and methods have been proposed, differing in the way of describing the syntax of the language, the way of using this information in the analysis or synthesis of a NL sentence, as well as the way of representing the syntactic structure of the sentence. Quite conventionally, we can distinguish three main approaches to creating models: a generative approach, going back to the ideas of Chomsky, an approach going back to the ideas of I. Melchuk and represented by the “Meaning-Text” model, as well as an approach within which certain attempts are made to overcome the limitations of the first two approaches, in particular, the theory of syntactic groups.

Within the generative approach, syntactic analysis is usually performed on the basis of a formal context-free grammar that describes the phrasal structure of a sentence, or on the basis of some extension of the context-free grammar. These grammars are based on the consistent linear division of a sentence into phrases (syntactic constructions, for example, noun phrases) and therefore simultaneously reflect both its syntactic and linear structures. The hierarchical syntactic structure of the NL sentence obtained as a result of the analysis is described tree of components, the leaves of which contain the words of the sentence, the subtrees correspond to the syntactic constructions (phrases) included in the sentence, and the arcs express the nesting relationships of the constructions.

The approach under consideration may include network grammars, which are both an apparatus for describing a language system and for specifying a procedure for analyzing sentences based on the concept of a finite state machine, for example, the extended transition network ATN.

Within the second approach, a more visual and common method is used to represent the syntactic structure of a sentence - dependency trees. The nodes of the tree contain the words of the sentence (the root is usually a verb-predicate), and each arc of the tree connecting a pair of nodes is interpreted as syntactic subordinating connection between them, and the direction of the connection corresponds to the direction of the given arc. Since in this case the syntactic connections of words and the order of words in a sentence are separated, then, on the basis of subordination trees, broken and non-projective constructions that appear quite often in languages with free word order.

Component trees are more suitable for describing languages in a rigid word order; representing broken and non-projective constructions with their help requires expanding the grammatical formalism used. But within the framework of this approach, constructions with non-subordinating relations are more naturally described. At the same time, a common difficulty for both approaches is the representation of homogeneous members of a sentence.

Syntactic models in all approaches try to take into account the restrictions imposed on the connection of linguistic units in speech, while in one way or another the concept of valence is used. Valence- this is the ability of a word or other unit of language to connect other units in a certain syntactic way; actant is a word or syntactic construction that fills this valency. For example, the Russian verb hand over has three main valences, which can be expressed by the following interrogative words: Who? to whom? What? Within the framework of the generative approach, the valences of words (primarily verbs) are described mainly in the form of special frames ( subcategorization frames) , and within the framework of the approach based on dependency trees - how management models.

Models of language semantics are the least developed within the framework of CL. For the semantic analysis of sentences, so-called case grammars and semantic cases(valence), on the basis of which the semantics of a sentence is described both through the connections of the main word (verb) with its semantic actants, i.e. through semantic cases. For example, verb hand over described by semantic cases giving(agent), addressee And transfer object.

To represent the semantics of an entire text, two logically equivalent formalisms are commonly used (both of which are described in detail within the AI framework):

· Formulas for the calculus of predicates expressing properties, states, processes, actions and relationships;

· Semantic networks are labeled graphs in which vertices correspond to concepts, and vertices correspond to relationships between them.

As for models of pragmatics and discourse, which allow processing not only individual sentences, but also the text as a whole, Van Dyck’s ideas are mainly used to construct them. One of the rare and successful models is the model of discursive synthesis of coherent texts. Such models must take into account anaphoric references and other discourse-level phenomena.

Concluding the characterization of language models within the framework of CL, let us dwell in a little more detail on the theory of linguistic models “Meaning-Text”, and within the framework of which many fruitful ideas appeared that were ahead of their time and are still relevant today.

In accordance with this theory, NL is considered as a special kind of transformer that processes given meanings into corresponding texts and given texts into corresponding meanings. The meaning is understood as the invariant of all synonymous transformations of the text. The content of a coherent fragment of speech without division into phrases and word forms is displayed in the form of a special semantic representation, consisting of two components: semantic graph and information about communicative organization of meaning.

The distinctive features of the theory should be indicated:

o orientation towards the synthesis of texts (the ability to generate correct texts is considered as the main criterion of linguistic competence);

o multi-level, modular nature of the model, with the main levels of language divided into superficial and deep levels: they differ, for example, deep(semantized) and surface(“pure”) syntax, as well as surface-morphological and deep-morphological levels;

o the integral nature of the language model; storage of information presented at each level by the corresponding module, performing the transition from this level to the next;

o special means of describing syntactics (rules for connecting units) at each level; a set was proposed to describe lexical compatibility lexical functions, with the help of which the rules of syntactic paraphrasing are formulated;

o emphasis on vocabulary rather than grammar; the dictionary stores information related to different levels of language; in particular, word control models that describe their syntactic and semantic valences are used for syntactic analysis.

This theory and language model are embodied in the ETAP machine translation system.

4. Linguistic resources

The development of linguistic processors requires an appropriate representation of linguistic information about the processed language. This information is displayed in a variety of computer dictionaries and grammars.

Dictionaries are the most traditional form of representing lexical information; they differ in their units (usually words or phrases), structure, and vocabulary coverage (dictionaries of terms in a specific problem area, dictionaries of general vocabulary, etc.). The vocabulary unit is called dictionary entry, it provides information about the token. Lexical homonyms are usually represented in different dictionary entries.

The most common in CL are morphological dictionaries used for morphological analysis; their dictionary entry presents morphological information about the corresponding word - part of speech, inflectional class (for inflectional languages), a list of word meanings, etc. Depending on the organization of the linguistic processor in the dictionary grammatical information can also be added, for example, word control models.

There are dictionaries that provide broader information about words. For example, the linguistic model “Meaning-Text” relies significantly on explanatory combinatorial dictionary, in the dictionary entry of which, in addition to morphological, syntactic and semantic information (syntactic and semantic valences), information about the lexical compatibility of this word is presented.

A number of linguistic processors use synonym dictionaries. A relatively new type of dictionary - paronym dictionaries, i.e. outwardly similar words that differ in meaning, for example, stranger And alien, editing And reference .

Another type of lexical resources is phrase databases, in which the most typical phrases of a particular language are selected. This database of Russian language phrases (about a million units) forms the core of the CrossLexica system.

More complex types of lexical resources are thesauri and ontologies. A thesaurus is a semantic dictionary, i.e. a dictionary in which the semantic connections of words are presented - synonymous, genus-type relationships (sometimes called the above-below relationship), part-whole, associations. The dissemination of thesauruses is associated with solving information retrieval problems.

Closely related to the concept of a thesaurus is the concept of ontology. Ontology is a set of concepts and entities of a certain field of knowledge, oriented towards reuse for various tasks. Ontologies can be created on the basis of existing vocabulary in a language - in this case they are called linguistic And.

A similar linguistic ontology is considered to be the WordNet system - a large lexical resource that contains English words: nouns, adjectives, verbs and adverbs, and presents their semantic connections of several types. For each of the specified parts of speech, the words are grouped into groups of synonyms ( synsets), between which the relations of antonymy, hyponymy (genus-species relation), meronymy (part-whole relation) are established. The resource contains approximately 25 thousand words, the number of hierarchy levels for the genus-species relationship is on average 6-7, sometimes reaching 15. The top level of the hierarchy forms a general ontology - a system of basic concepts about the world.

Based on the English WordNet scheme, similar lexical resources for other European languages were built, united under the general name EuroWordNet.

A completely different type of linguistic resources is NL grammar, the type of which depends on the syntax model used in the processor. To a first approximation, grammar is a set of rules expressing the general syntactic properties of words and groups of words. The total number of grammar rules also depends on the syntax model, varying from several tens to several hundred. Essentially, a problem arises here as the relationship between grammar and vocabulary in a language model: the more information is presented in the dictionary, the shorter the grammar can be and vice versa.

Note that the construction of computer dictionaries, thesauruses and grammars is a voluminous and labor-intensive work, sometimes even more labor-intensive than the development of a linguistic model and the corresponding processor. Therefore, one of the subordinate tasks of CL is the automation of the construction of linguistic resources.

Computer dictionaries are often formed by converting ordinary text dictionaries, but their construction often requires much more complex and painstaking work. This usually happens when constructing dictionaries and thesauri for rapidly developing scientific fields - molecular biology, computer science, etc. The source material for extracting the necessary linguistic information can be collections and text corpora.

A corpus of texts is a collection of texts collected according to a certain principle of representativeness (by genre, authorship, etc.), in which all texts are marked, that is, provided with some linguistic markings (annotations) - morphological, accentual, syntactic, etc. . n. Currently, there are at least a hundred different corpora - for different languages and with different markings; in Russia, the most famous is the National Corpus of the Russian Language.

Labeled corpora are created by linguists and are used both for linguistic research and for tuning (training) models and processors used in CL using well-known mathematical methods of machine learning. Thus, machine learning is used to configure methods for resolving lexical ambiguity, recognizing parts of speech, and resolving anaphoric references.

Since corpora and collections of texts are always limited in terms of the linguistic phenomena represented in them (and corpora, among other things, take quite a long time to create), recently Internet texts are increasingly being considered as a more complete linguistic resource. Of course, the Internet is the most representative source of modern speech samples, but its use as a corpus requires the development of special technologies.

5. Applications of computational linguistics

The field of applications of computational linguistics is constantly expanding, so we will characterize here the most well-known applied problems solved by its tools.

Machine translate– the earliest application of CL, along with which this field itself arose and developed. The first translation programs were built more than 50 years ago and were based on a simple word-by-word translation strategy. However, it was quickly realized that machine translation requires a complete linguistic model that takes into account all levels of language, right down to semantics and pragmatics, which has repeatedly hampered the development of this area. A fairly complete model is used in the domestic ETAP system, which translates scientific texts from French into Russian.

Note, however, that in the case of translation into a related language, for example, when translating from Spanish to Portuguese or from Russian to Ukrainian (which have much in common in syntax and morphology), the processor can be implemented based on a simplified model, for example, based on using the same word-by-word translation strategy.

Currently, there is a whole range of computer translation systems (of varying quality), from large international research projects to commercial automatic translators. Of significant interest are multilingual translation projects using an intermediate language in which the meaning of the translated phrases is encoded. Another modern direction is statistical translation, based on statistics on the translation of words and phrases (these ideas, for example, are implemented in the Google search engine translator).

But despite many decades of development in this entire area, in general the problem of machine translation is still very far from being completely solved.

Another fairly old application of computational linguistics is information retrieval and related tasks of indexing, abstracting, classification and rubrication of documents.

Full-text search of documents in large databases of documents (primarily scientific, technical, business) is usually carried out on the basis of their search images, by which we mean a set keywords– words reflecting the main topic of the document. At first, only individual words of the NL were considered as keywords, and the search was carried out without taking into account their inflection, which is uncritical for weakly inflected languages such as English. For inflected languages, for example, Russian, it was necessary to use a morphological model that takes into account inflection.

The search query was also presented as a set of words; suitable (relevant) documents were determined based on the similarity of the query and the search image of the document. Creating a search image of a document involves indexing its text, i.e. highlighting key words in it. Since very often the topic and content of a document are reflected much more accurately not by individual words, but by phrases, phrases began to be considered as keywords. This significantly complicated the procedure for indexing documents, since it was necessary to use various combinations of statistical and linguistic criteria to select significant phrases in the text.

In fact, information retrieval mainly uses vector text model(sometimes called bag of words– bag of words), in which a document is represented as a vector (set) of its keywords. Modern Internet search engines also use this model, indexing texts by words used in them (at the same time, they use very sophisticated ranking procedures to return relevant documents).

The specified text model (with some complications) is also used in related information retrieval problems discussed below.

Summarizing text- reducing its volume and obtaining a summary of it - an abstract (condensed content), which makes searching in document collections faster. A general abstract can also be compiled for several documents related to the topic.

The main method of automatic abstracting is still the selection of the most significant sentences of the text being abstracted, for which usually the keywords of the text are first calculated and the significance coefficient of the text sentences is calculated. The selection of significant sentences is complicated by anaphoric connections of sentences, the breaking of which is undesirable - to solve this problem, certain strategies for selecting sentences are being developed.

A task close to abstracting is annotation text of the document, i.e. drawing up its annotation. In its simplest form, an abstract is a list of the main topics of a text that indexing procedures can be used to identify.

When creating large collections of documents, the following tasks are relevant: classifications And clustering texts in order to create classes of documents related to the topic. Classification means assigning each document to a specific class with previously known parameters, and clustering means dividing a set of documents into clusters, i.e. subsets of thematically similar documents. To solve these problems, machine learning methods are used, and therefore these applied problems are called Text Mining and belong to the scientific direction known as Data Mining, or data mining.

The problem is very close to classification rubrication text - its assignment to one of the previously known thematic headings (usually headings form a hierarchical tree of topics).

The classification problem is becoming increasingly widespread; it is solved, for example, in spam recognition, and a relatively new application is the classification of SMS messages in mobile devices. A new and relevant direction of research for the general problem of information retrieval is multilingual document search.

Another relatively new task related to information retrieval is generating answers to questions(Question Answering) . This problem is solved by determining the type of question, searching for texts that potentially contain the answer to this question, and extracting the answer from these texts.

A completely different applied direction, which is developing, although slowly but steadily, is automation of preparation and editing texts in EA. One of the first applications in this direction were programs for automatically determining word hyphens and text spelling programs (spellers, or auto-correctors). Despite the apparent simplicity of the transfer problem, its correct solution for many languages (for example, English) requires knowledge of the morphemic structure of words in the corresponding language, and therefore the corresponding dictionary.

Spell checking has long been implemented in commercial systems and relies on an appropriate dictionary and morphology model. An incomplete syntax model is also used, on the basis of which all syntactic errors that are quite frequent are identified (for example, word agreement errors). At the same time, auto-correctors have not yet implemented detection of more complex errors, for example, incorrect use of prepositions. Many lexical errors are also not detected, in particular, errors resulting from typos or incorrect use of similar words (for example, weight instead of weighty). Modern CL research proposes methods for automated detection and correction of such errors, as well as some other types of stylistic errors. These methods use statistics on the occurrence of words and phrases.

An applied task close to supporting the preparation of texts is natural language teaching, within the framework of this direction, computer systems for teaching languages - English, Russian, etc. are often developed (similar systems can be found on the Internet). Typically, these systems support the study of individual aspects of language (morphology, vocabulary, syntax) and are based on appropriate models, for example, the morphology model.

As for learning vocabulary, electronic analogues of text dictionaries (which essentially do not have language models) are also used for this. However, multifunctional computer dictionaries are also being developed that do not have text analogues and are aimed at a wide range of users - for example, the Crosslexics dictionary of Russian phrases. This system covers a wide range of vocabulary - words and their acceptable word combinations, and also provides help on word management models, synonyms, antonyms and other semantic correlates of words, which is clearly useful not only for those who study the Russian language, but also for native speakers.

The next application area worth mentioning is automatic generation texts in EA. In principle, this task can be considered a subtask of the machine translation task already discussed above, however, within the framework of the direction there are a number of specific tasks. Such a task is multilingual generation, i.e. the automatic construction of special documents in several languages - patent formulas, operating instructions for technical products or software systems, based on their specifications in a formal language. To solve this problem, fairly detailed language models are used.

An increasingly relevant applied problem, often referred to as Text Mining, is information extraction from texts, or Information Extraction, which is required when solving problems of economic and production analytics. To do this, certain objects are identified in the NL test - named entities (names, personalities, geographical names), their relationships and events associated with them. As a rule, this is implemented on the basis of partial parsing of the text, which allows processing of news streams from news agencies. Since the task is quite complex not only theoretically, but also technologically, the creation of significant systems for extracting information from texts is feasible within commercial companies.

The field of Text Mining also includes two other related tasks – opinion mining (Opinion Mining) and sentiment analysis (Sentiment Analysis), which are attracting the attention of an increasing number of researchers. The first task involves searching (in blogs, forums, online stores, etc.) user opinions about products and other objects, and also analyzing these opinions. The second task is close to the classical task of content analysis of mass communication texts; it evaluates the general tone of statements.

Another app worth mentioning is dialogue support with the user on EA within the framework of any information software system. Most often, this problem was solved for specialized databases - in this case, the query language is quite limited (lexically and grammatically), which allows the use of simplified language models. Queries to the database, formulated in NL, are translated into formal language, after which the required information is searched and the corresponding response phrase is constructed.

As the last in our list of CL applications (but not least important), we indicate speech recognition and synthesis. Recognition errors that inevitably arise in these tasks are corrected by automatic methods based on dictionaries and linguistic knowledge of morphology. Machine learning will also be used in this area.

Conclusion

Computational linguistics demonstrates quite tangible results in various applications for automatic text processing in NL. Its further development depends both on the emergence of new applications and the independent development of various language models, in which many problems have not yet been solved. The most developed models are morphological analysis and synthesis. Syntax models have not yet been brought to the level of stable and efficient working modules, despite the large number of proposed formalisms and methods. Even less studied and formalized are models at the level of semantics and pragmatics, although automatic processing of discourse is already required in a number of applications. Note that already existing tools of computational linguistics itself, the use of machine learning and text corpora, can significantly advance the solution of these problems.

Literature

1. Baeza-Yates, R. and Ribeiro-Neto, B. Modern Information Retrieval, Adison Wesley, 1999.

2. Bateman, J., Zock M. Natural Language Generation. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p.304.

3. Biber, D., Conrad S., and Reppen D. Corpus Linguistics. Investigating Language Structure and Use. Cambridge University Press, Cambridge, 1998.

4. Bolshakov, I. A., Gelbukh putational Linguistics. Models, Resources, Applications. Mexico, IPN, 2004.

5. Brown P., Pietra S., Mercer R., Pietra V. The Mathematics of Statistical Machine Translation. // Computational Linguistics, Vol. 19(2): 263-3

6. Carroll J R. Parsing. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 233-248.

7. Chomsky, N. Syntactic Structures. The Hague: Mouton, 1957.

8. Grishman R. Information extraction. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 545-559.

9. Harabagiu, S., Moldovan D. Question Answering. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 560-582.

10. Hearst, M. A. Automated Discovery of WordNet Relations. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database. MIT Press, Cambridge, 1998, p.131-151.

11. Hirst, G. Ontology and the Lexicon. In.: Handbook on Ontologies in Information Systems. Berlin, Springer, 2003.

12. Jacquemin C., Bourigault D. Term extraction and automatic indexing // Mitkov R. (ed.): Handbook of Computational Linguistics. Oxford University Press, 2003. p. 599-615.

13. Kilgarriff, A., G. Grefenstette. Introduction to the Special Issue on the Web as putational linguistics, V. 29, No. 3, 2003, p. 333-347.

14. Manning, Ch. D., H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, 1999.

15. Matsumoto Y. Lexical Knowledge Acquisition. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 395-413.

16. The Oxford Handbook on Computational Linguistics. R. Mitkov (Ed.). Oxford University Press, 2005.

17. Oakes, M., Paice C. D. Term extraction for automatic abstracting. Recent Advances in Computational Terminology. D. Bourigault, C. Jacquemin and M. L'Homme (Eds), John Benjamins Publishing Company, Amsterdam, 2001, p.353-370.

18. Pedersen, T. A decision tree of bigrams is an accurate predictor of word senses. Proc. 2nd Annual Meeting of NAC ACL, Pittsburgh, PA, 2001, p. 79-86.

19. Samuelsson C. Statistical Methods. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 358-375.

20. Salton, G. Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Reading, MA: Addison-Wesley, 1988.

21. Somers, H. Machine Translation: Latest Developments. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 512-528.

22. Strzalkowski, T. (ed.) Natural Language Information Retrieval. Kluwer, 19p.

23. Woods W. A. Transition Network Grammers for Natural language Analysis/ Communications of the ACM, V. 13, 1970, N 10, p. 591-606.

24. Word Net: an Electronic Lexical Database. / Christiane Fellbaum. Cambridge, MIT Press, 1998.

25. Wu J., Yu-Chia Chang Y., Teruko Mitamura T., Chang J. Automatic Collocation Suggestion in Academic Writing // Proceedings of the ACL 2010 Conference Short Papers, 2010.

26. and others. Linguistic support of the ETAP-2 system. M.: Nauka, 1989.

27. etc. Data analysis technologies: Data Mining, Visual Mining, Text Mining, OLAP – 2nd ed. – St. Petersburg: BHV-Petersburg, 2008.

28. Bolshakov, Lexika - a large electronic dictionary of combinations and semantic connections of Russian words. //Comp. linguistics and intelligence. technology: Proceedings int. Conf. "Dialogue 2009". Issue: Russian State University for the Humanities, 2009, pp. 45-50.

29. Bolshakova E.I., Bolshakov detection and automated correction of Russian malapropisms // NTI. Ser. 2, No. 5, 2007, pp. 27-40.

30. Wang, Kinch V. Strategy for understanding a connected text. // New in foreign linguistics. Vol. XXIII– M., Progress, 1988, p. 153-211.

31. Vasiliev V. G., Krivenko M. P. Methods of automated text processing. – M.: IPI RAS, 2008.

32. Vinograd T. A program that understands natural language - M., Mir, 1976.

33. Smooth natural language structures in automated communication systems. – M., Nauka, 1985.

34. Gusev, V. D., Salomatina dictionary of paronyms: version 2. // NTI, Ser. 2, No. 7, 2001, p. 26-33.

35. Zakharov - space as a language corpus // Computer linguistics and intellectual technologies: Proceedings of the International. Conference Dialogue ‘2005 / Ed. , – M.: Nauka, 2005, p. 166-171.

36. Kasevich of general linguistics. - M., Nauka, 1977.

37. Leontief understanding of texts: Systems, models, resources: Textbook - M.: Academy, 2006.

38. Linguistic encyclopedic dictionary / Ed. V. N. Yartseva, M.: Soviet Encyclopedia, 1990, 685 p.

39. , Salium for automatic indexing and categorization: development, structure, maintenance. // NTI, Ser. 2, No. 1, 1996.

40. Luger J. Artificial intelligence: strategies and methods for solving complex problems. M., 2005.

41. McQueen K. Discursive strategies for text synthesis in natural language // New in foreign linguistics. Vol. XXIV. M.: Progress, 1989, pp. 311-356.

42. Melchuk theory of linguistic models “MEANING “TEXT”. - M., Nauka, 1974.

43. National Corpus of the Russian Language. http://*****

44. Khoroshevsky V. F. OntosMiner: a family of systems for extracting information from multilingual collections of documents // Ninth National Conference on Artificial Intelligence with International Participation KII-2004. T. 2. – M.: Fizmatlit, 2004, p.573-581.

linguistics statistical linguistics software

History of the development of computational linguistics

The process of formation and formation of modern linguistics as a science of natural language represents a long historical development of linguistic knowledge. Linguistic knowledge is based on elements that were formed in the process of activities inextricably linked with the development of the structure of oral speech, the emergence, further development and improvement of writing, learning to write, as well as the interpretation and decoding of texts.

Natural language as an object of linguistics occupies a central place in this science. In the process of language development, ideas about it also changed. If previously no special importance was attached to the internal organization of language, and it was considered primarily in the context of its relationship with the outside world, then, starting from the late 19th - early 20th centuries, a special role was assigned to the internal formal structure of the language. It was during this period that the famous Swiss linguist Ferdinand de Saussure developed the foundations of such sciences as semiology and structural linguistics, and set out in detail in his book “A Course in General Linguistics” (1916).

The scientist came up with the idea of considering language as a single mechanism, an integral system of signs, which in turn makes it possible to describe language mathematically. Saussure was the first to propose a structural approach to language, namely: a description of language by studying the relationships between its units. By units, or “signs,” he understood a word that combines both meaning and sound. The concept proposed by the Swiss scientist is based on the theory of language as a system of signs consisting of three parts: language (from the French langue), speech (from the French parole) and speech activity (from the French langage).

The scientist himself defined the science he created, semiology, as “a science that studies the life of signs within the framework of the life of society.” Since language is a sign system, in search of an answer to the question of what place linguistics occupies among other sciences, Saussure argued that linguistics is part of semiology. It is generally accepted that it was the Swiss philologist who laid the theoretical foundation for a new direction in linguistics, becoming the founder and “father” of modern linguistics.

The concept put forward by F. de Saussure was further developed in the works of many outstanding scientists: in Denmark - L. Hjelmslev, in the Czech Republic - N. Trubetskoy, in the USA - L. Bloomfield, Z. Harris, N. Chomsky. As for our country, here structural linguistics began its development at approximately the same period of time as in the West - at the turn of the 19th-20th centuries. - in the works of F. Fortunatov and I. Baudouin de Courtenay. It should be noted that I. Baudouin de Courtenay worked closely with F. de Saussure. If Saussure laid the theoretical foundation of structural linguistics, then Baudouin de Courtenay can be considered the person who laid the foundations for the practical application of the methods proposed by the Swiss scientist. It was he who defined linguistics as a science that uses statistical methods and functional dependencies, and separated it from philology. The first experience in the application of mathematical methods in linguistics was phonology - the science of the structure of the sounds of a language.

It should be noted that the postulates put forward by F. de Saussure were able to be reflected in the problems of linguistics that were relevant in the middle of the 20th century. It was during this period that there was a clear tendency towards mathematization of the science of language. In almost all large countries, the rapid development of science and computer technology begins, which in turn required increasingly new linguistic foundations. The result of all this was the rapid convergence of the exact sciences and the humanities, as well as the active interaction of mathematics and linguistics, which found practical application in solving pressing scientific problems.

In the 50s of the 20th century, at the intersection of such sciences as mathematics, linguistics, computer science and artificial intelligence, a new direction of science arose - computer linguistics (also known as machine linguistics or automatic text processing in natural language). The main stages of development of this direction took place against the backdrop of the evolution of artificial intelligence methods. A powerful impetus for the development of computer linguistics was the creation of the first computers. However, with the advent of a new generation of computers and programming languages in the 60s, a fundamentally new stage in the development of this science begins. It should also be noted that the origins of computational linguistics go back to the works of the famous American linguist N. Chomsky in the field of formalizing the structure of language. The results of his research, obtained at the intersection of linguistics and mathematics, formed the basis for the development of the theory of formal languages and grammars (generative, or generative, grammars), which is widely used to describe both natural and artificial languages, in particular programming languages. To be more precise, this theory is a completely mathematical discipline. It can be considered one of the first in such a direction of applied linguistics as mathematical linguistics.

The first experiments and first developments in computational linguistics relate to the creation of machine translation systems, as well as systems that model human language abilities. In the late 80s, with the advent and active development of the Internet, there was a rapid growth in the volume of text information available in electronic form. This has led to the fact that information retrieval technologies have moved to a qualitatively new stage of their development. The need arose for automatic processing of texts in natural language, and completely new tasks and technologies appeared. Scientists are faced with the problem of quickly processing a huge stream of unstructured data. In order to find a solution to this problem, great importance has been given to the development and application of statistical methods in the field of automatic text processing. It was with their help that it became possible to solve such problems as dividing texts into clusters united by a common theme, highlighting certain fragments in the text, etc. In addition, the use of methods of mathematical statistics and machine learning made it possible to solve the problems of speech recognition and the creation of search engines.

Scientists did not stop at the results achieved: they continued to set themselves new goals and objectives, develop new techniques and research methods. All this led to the fact that linguistics began to act as an applied science, combining a number of other sciences, the leading role among which belonged to mathematics with its variety of quantitative methods and the ability to use them for a deeper understanding of the phenomena being studied. This is how mathematical linguistics began its formation and development. At the moment, this is a fairly “young” science (it has existed for about fifty years), however, despite its very “young age”, it represents an already established field of scientific knowledge with many successful achievements.

The content of the article

COMPUTER LINGUISTICS, direction in applied linguistics, focused on the use of computer tools - programs, computer technologies for organizing and processing data - to model the functioning of language in certain conditions, situations, problem areas, etc., as well as the entire scope of application of computer language models in linguistics and related disciplines. Actually, only in the latter case are we talking about applied linguistics in the strict sense, since computer modeling of language can also be considered as a field of application of computer science and programming theory to solving problems in the science of language. In practice, however, computational linguistics includes almost everything related to the use of computers in linguistics.

Computational linguistics took shape as a special scientific field in the 1960s. The Russian term “computer linguistics” is a translation from the English computational linguistics. Since the adjective computational in Russian can also be translated as “computational,” the term “computational linguistics” is also found in the literature, but in Russian science it takes on a narrower meaning, approaching the concept of “quantitative linguistics.” The flow of publications in this area is very large. In addition to thematic collections, the journal Computer Linguistics is published quarterly in the United States. Much organizational and scientific work is carried out by the Association for Computational Linguistics, which has regional structures (in particular, a European branch). Every two years, international conferences on computational linguistics – COLING – are held. The corresponding issues are usually widely represented at various conferences on artificial intelligence.

Toolkit for Computational Linguistics.

Computational linguistics as a special applied discipline is distinguished primarily by its instrument - i.e. on the use of computer tools for processing language data. Since computer programs that model certain aspects of the functioning of a language can use a variety of programming tools, there seems to be no need to talk about the general conceptual apparatus of computer linguistics. However, it is not. There are general principles of computer modeling of thinking, which are somehow implemented in any computer model. They are based on the theory of knowledge, which was originally developed in the field of artificial intelligence, and later became one of the branches of cognitive science. The most important conceptual categories of computer linguistics are such knowledge structures as “frames” (conceptual, or, as they say, conceptual structures for the declarative representation of knowledge about a typified thematically unified situation), “scenarios” (conceptual structures for the procedural representation of knowledge about a stereotypical situation or stereotypical behavior), “plans” (knowledge structures that capture ideas about possible actions leading to achieving a certain goal). Closely related to the category of frame is the concept of “scene”. The scene category is mainly used in the literature on computer linguistics as a designation of a conceptual structure for the declarative representation of situations and their parts actualized in a speech act and highlighted by linguistic means (lexemes, syntactic constructions, grammatical categories, etc.).

An organized set of knowledge structures in a certain way forms the “world model” of the cognitive system and its computer model. In artificial intelligence systems, the world model forms a special block, which, depending on the chosen architecture, may include general knowledge about the world (in the form of simple propositions such as “it’s cold in winter” or in the form of production rules “if it’s raining outside, then you need to wear a raincoat or take an umbrella”), some specific facts (“The highest peak in the world is Everest”), as well as values and their hierarchies, sometimes separated into a special “axiological block”.

Most elements of the concepts of the tools of computational linguistics are homonymous: they simultaneously designate some real entities of the human cognitive system and ways of representing these entities used in their theoretical description and modeling. In other words, elements of the conceptual apparatus of computer linguistics have ontological and instrumental aspects. For example, in the ontological aspect, the division of declarative and procedural knowledge corresponds to different types of knowledge available to a person - the so-called knowledge WHAT (declarative; such, for example, knowledge of the postal address of some NN), on the one hand, and knowledge HOW (procedural; such , for example, knowledge that allows you to find the apartment of this NN, even without knowing its formal address) - on the other. In the instrumental aspect, knowledge can be embodied in a set of descriptions (descriptions), in a set of data, on the one hand, and in an algorithm, an instruction carried out by a computer or some other model of a cognitive system, on the other.

Directions of computational linguistics.

The field of CL is very diverse and includes such areas as computer modeling of communication, plot structure modeling, hypertext technologies for text presentation, machine translation, and computer lexicography. In a narrow sense, the problems of CL are often associated with an interdisciplinary applied area with the somewhat unfortunate name “natural language processing” (translation of the English term Natural Language Processing). It arose in the late 1960s and developed within the scientific and technological discipline of “artificial intelligence”. In its internal form, the phrase “natural language processing” covers all areas in which computers are used to process language data. Meanwhile, a narrower understanding of this term has taken hold in practice - the development of methods, technologies and specific systems that ensure communication between a person and a computer in natural or limited natural language.

The rapid development of the field of “natural language processing” occurred in the 1970s, which was associated with an unexpected exponential growth in the number of end users of computers. Since teaching languages and programming technology to all users is impossible, the problem of organizing interaction with computer programs has arisen. The solution to this communication problem followed two main paths. In the first case, attempts were made to adapt programming languages and operating systems to the end user. As a result, high-level languages such as Visual Basic appeared, as well as convenient operating systems built in the conceptual space of metaphors familiar to humans - DESK, LIBRARY. The second way is to develop systems that would allow interaction with a computer in a specific problem area in natural language or some limited version of it.

The architecture of natural language processing systems in the general case includes a block for analyzing the user's speech message, a block for interpreting the message, a block for generating the meaning of the response, and a block for synthesizing the surface structure of the statement. A special part of the system is the dialogue component, which records strategies for conducting dialogue, conditions for using these strategies, and ways to overcome possible communication failures (failures in the communication process).

Among computer natural language processing systems, question-answer systems, interactive problem-solving systems, and connected text processing systems are usually distinguished. Initially, question-answer systems began to be developed as a reaction to the poor quality of query encoding when searching for information in information retrieval systems. Since the problem area of such systems was very limited, this somewhat simplified the algorithms for translating queries into a representation in a formal language and the reverse procedure for converting a formal representation into statements in a natural language. Among the domestic developments, programs of this type include the POET system, created by a team of researchers under the leadership of E.V. Popov. The system processes requests in Russian (with minor restrictions) and synthesizes the answer. The program flowchart involves going through all stages of analysis (morphological, syntactic and semantic) and the corresponding stages of synthesis.

Conversational problem solving systems, unlike systems of the previous type, play an active role in communication, since their task is to obtain a solution to the problem based on the knowledge that is presented in it and the information that can be obtained from the user. The system contains knowledge structures that record typical sequences of actions for solving problems in a given problem area, as well as information about the necessary resources. When a user asks a question or sets a specific task, the corresponding script is activated. If some script components are missing or some resources are missing, the system initiates communication. This is how, for example, the SNUKA system works, solving the problems of planning military operations.

Systems for processing connected texts are quite diverse in structure. Their common feature can be considered the widespread use of knowledge representation technologies. The functions of systems of this kind are to understand the text and answer questions about its content. Understanding is not considered as a universal category, but as a process of extracting information from a text, determined by a specific communicative intention. In other words, the text is “read” only with the assumption that exactly what the potential user wants to know about it. Thus, systems for processing connected texts turn out to be not universal, but problem-oriented. Typical examples of systems of the type discussed are the RESEARCHER and TAILOR systems, which form a single software package that allows the user to obtain information from patent abstracts describing complex physical objects.

The most important area of computer linguistics is the development of information retrieval systems (IRS). The latter arose in the late 1950s and early 1960s as a response to the sharp increase in the volume of scientific and technical information. Based on the type of information stored and processed, as well as the search features, information retrieval systems are divided into two large groups - documentary and factual. Documentary information retrieval systems store the texts of documents or their descriptions (abstracts, bibliographic cards, etc.). Factual IRS deal with the description of specific facts, and not necessarily in text form. These can be tables, formulas and other types of data presentation. There are also mixed information systems, including both documents and factual information. Currently, factual information systems are built on the basis of database technologies (DB). To ensure information retrieval in the information retrieval system, special information retrieval languages are created, which are based on information retrieval thesauruses. Information retrieval language is a formal language designed to describe certain aspects of the content plan of documents stored in the information retrieval system and the request. The procedure for describing a document in an information retrieval language is called indexing. As a result of indexing, each document is assigned its formal description in an information retrieval language - a search image of the document. The query is indexed in a similar way, to which a search query image and a search prescription are assigned. Information retrieval algorithms are based on comparing the search prescription with the search image of the query. The criterion for issuing a document to a request may be a complete or partial match of the search image of the document and the search instruction. In some cases, the user has the opportunity to formulate the issuance criteria himself. This is determined by his information need. Automated information retrieval systems often use descriptor information retrieval languages. The subject of a document is described by a set of descriptors. The descriptors are words and terms that denote simple, fairly elementary categories and concepts of the problem area. As many descriptors are entered into the search image of the document as there are different topics covered in the document. The number of descriptors is not limited, which allows you to describe the document in a multidimensional matrix of features. Often in a descriptor information retrieval language, restrictions are imposed on the compatibility of descriptors. In this case, we can say that the information retrieval language has syntax.

One of the first systems that worked with a descriptor language was the American UNITERM system, created by M. Taube. Document keywords—uniterms—functioned as descriptors in this system. The peculiarity of this IRS is that initially the dictionary of the information language was not specified, but arose in the process of indexing the document and query. The development of modern information retrieval systems is associated with the development of non-thesaurus type information retrieval systems. Such information systems work with the user in a limited natural language, and the search is carried out through the texts of document abstracts, through their bibliographic descriptions, and often through the documents themselves. For indexing in the non-thesaurus type IRS, words and phrases of natural language are used.

To a certain extent, the field of computer linguistics can include work in the field of creating hypertext systems, considered as a special way of organizing text and even as a fundamentally new type of text, contrasted in many of its properties with ordinary text formed in the Gutenberg tradition of printing. The idea of hypertext is associated with the name of Vannevar Bush, President F. Roosevelt's advisor on science. V. Bush theoretically substantiated the project of the Memex technical system, which allowed the user to connect texts and their fragments using various types of connections, mainly by associative relationships. The lack of computer technology made the project difficult to implement, since the mechanical system turned out to be too complex for practical implementation.

Bush's idea was reborn in the 1960s in T. Nelson's Xanadu system, which already involved the use of computer technology. “Xanadu” allowed the user to read a set of texts entered into the system in different ways, in different sequences; the software made it possible to both remember the sequence of viewed texts and select almost any of them at any time. A set of texts with relationships connecting them (a system of transitions) was called hypertext by T. Nelson. Many researchers view the creation of hypertext as the beginning of a new information era, opposed to the era of printing. The linearity of writing, which outwardly reflects the linearity of speech, turns out to be a fundamental category that limits human thinking and understanding of the text. The world of meaning is nonlinear, therefore, compression of semantic information in a linear speech segment requires the use of special “communicative packages” - division into theme and rheme, division of the content plan of an utterance into explicit (statement, proposition, focus) and implicit (presupposition, consequence, discourse implicature) layers . Refusal of the linearity of the text both in the process of its presentation to the reader (i.e. during reading and understanding) and in the process of synthesis, according to theorists, would contribute to the “liberation” of thinking and even the emergence of its new forms.

In a computer system, hypertext is presented in the form of a graph, the nodes of which contain traditional texts or their fragments, images, tables, videos, etc. The nodes are connected by a variety of relationships, the types of which are specified by hypertext software developers or by the reader himself. Relationships define the potential possibilities of movement, or navigation through hypertext. Relationships can be unidirectional or bidirectional. Accordingly, bidirectional arrows allow the user to move in both directions, while unidirectional arrows allow the user to move only in one direction. The chain of nodes through which the reader passes when viewing the components of the text forms a path, or route.

Computer implementations of hypertext can be hierarchical or networked. The hierarchical – tree-like – structure of hypertext significantly limits the possibilities of transition between its components. In such a hypertext, the relationships between components resemble the structure of a thesaurus based on genus-species relationships. Network hypertext allows the use of various types of relationships between components, not limited to genus-species relationships. According to the method of existence of hypertext, static and dynamic hypertexts are distinguished. Static hypertext does not change during operation; in it the user can record his comments, but they do not change the essence of the matter. For dynamic hypertext, change is a normal form of existence. Typically, dynamic hypertexts operate where it is necessary to constantly analyze the flow of information, i.e. in information services of various kinds. Hypertext is, for example, the Arizona Information System (AAIS), which is updated monthly by 300–500 abstracts per month.

The relationships between hypertext elements can be initially fixed by the creators, or they can be generated whenever a user accesses the hypertext. In the first case we are talking about hypertexts of a hard structure, and in the second – about hypertexts of a soft structure. The rigid structure is technologically quite understandable. The technology for organizing a soft structure should be based on a semantic analysis of the proximity of documents (or other sources of information) to each other. This is a non-trivial task in computational linguistics. Nowadays, the use of soft structure technologies on keywords is widespread. The transition from one node to another in a hypertext network is carried out as a result of searching for keywords. Since the set of keywords may be different each time, the structure of the hypertext changes each time.

The technology for building hypertext systems does not distinguish between text and non-text information. Meanwhile, the inclusion of visual and audio information (videos, pictures, photographs, sound recordings, etc.) requires a significant change in the user interface and more powerful software and computer support. Such systems are called hypermedia, or multimedia. The visibility of multimedia systems predetermined their widespread use in education and in the creation of computer versions of encyclopedias. There are, for example, beautifully produced CD-roms with multimedia systems based on children's encyclopedias published by Dorlin Kindersley.

Within the framework of computer lexicography, computer technologies for compiling and operating dictionaries are being developed. Special programs - databases, computer file cabinets, word processing programs - allow you to automatically generate dictionary entries, store dictionary information and process it. Many different computer lexicographic programs are divided into two large groups: programs for supporting lexicographic works and automatic dictionaries of various types, including lexicographic databases. An automatic dictionary is a dictionary in a special machine format intended for use on a computer by a user or a computer word processing program. In other words, there is a distinction between automatic dictionaries for the human end user and automatic dictionaries for word processing programs. Automatic dictionaries intended for the end user differ significantly in interface and structure of the dictionary entry from automatic dictionaries included in machine translation systems, automatic abstracting systems, information retrieval systems, etc. Most often they are computer versions of well-known conventional dictionaries. On the software market there are computer analogues of explanatory dictionaries of the English language (the automatic Webster, the automatic explanatory dictionary of the English language published by Collins, the automatic version of the New Large English-Russian Dictionary edited by Yu.D. Apresyan and E.M. Mednikova), there is also a computer version of Ozhegov's dictionary. Automatic dictionaries for word processing programs can be called automatic dictionaries in the strict sense. They are generally not intended for the average user. The features of their structure and the scope of vocabulary material are determined by the programs that interact with them.

Computer modeling of plot structure is another promising area of computer linguistics. The study of plot structure relates to the problems of structural literary criticism (in a broad sense), semiotics and cultural studies. Available computer programs for plot modeling are based on three basic formalisms for plot representation - the morphological and syntactic directions of plot representation, as well as on the cognitive approach. Ideas about the morphological structure of the plot structure go back to the famous works of V.Ya. Propp ( cm.) about a Russian fairy tale. Propp noticed that with the abundance of characters and events in a fairy tale, the number of functions of the characters is limited, and he proposed an apparatus for describing these functions. Propp's ideas formed the basis of the TALE computer program, which simulates the generation of a fairy tale plot. The algorithm of the TALE program is based on the sequence of functions of the characters in the fairy tale. In fact, Propp's functions defined a set of typified situations, ordered on the basis of an analysis of empirical material. The possibilities of linking various situations in the rules of generation were determined by a typical sequence of functions - in the form in which this can be established from the texts of fairy tales. In the program, typical function sequences were described as typical character encounter scenarios.

The theoretical basis of the syntactic approach to the plot of a text was “story grammars” or “story grammars”. They appeared in the mid-1970s as a result of the transfer of the ideas of N. Chomsky’s generative grammar to the description of the macrostructure of the text. If the most important components of the syntactic structure in a generative grammar were verb and noun phrases, then in most plot grammars the exposition (setting), event and episode were singled out as basic ones. In the theory of plot grammars, the conditions of minimality, that is, the restrictions that determine the status of a sequence of plot elements as a normal plot, have been widely discussed. It turned out, however, that this cannot be done using purely linguistic methods. Many restrictions are sociocultural in nature. Plot grammars, while differing significantly in the set of categories in the generation tree, allowed a very limited set of rules for modifying the narrative structure.

In the early 1980s, one of R. Schenk’s students, V. Lehnert, as part of her work on creating a computer plot generator, proposed an original formalism of emotional plot units (Affective Plot Units), which turned out to be a powerful means of representing plot structure. Despite the fact that it was originally developed for an artificial intelligence system, this formalism was used in purely theoretical studies. The essence of Lehnert's approach was that the plot was described as a sequential change in the cognitive-emotional states of the characters. Thus, the focus of Lehnert’s formalism is not on the external components of the plot - exposition, event, episode, morality - but on its content characteristics. In this respect, Lehnert's formalism is partly a return to Propp's ideas.

The competence of computer linguistics also includes machine translation, which is currently experiencing a rebirth.

Literature:

Popov E.V. Communication with a computer in natural language. M., 1982
Sadur V.G. Speech communication with electronic computers and problems of their development. – In the book: Speech communication: problems and prospects. M., 1983
Baranov A.N. Categories of artificial intelligence in linguistic semantics. Frames and scripts. M., 1987
Kobozeva I.M., Laufer N.I., Saburova I.G. Modeling communication in human-machine systems. – Linguistic support of information systems. M., 1987
Olker H.R. Fairy tales, tragedies and ways of presenting world history. – In the book: Language and modeling of social interaction. M., 1987
Gorodetsky B.Yu. Computational linguistics: modeling language communication
McQueen K. Discourse strategies for natural language text synthesis. – New in foreign linguistics. Vol. XXIV, Computational Linguistics. M., 1989
Popov E.V., Preobrazhensky A.B. . Features of the implementation of NL systems
Preobrazhensky A.B. State of development of modern NL systems. - Artificial intelligence. Book 1, Communication systems and expert systems. M., 1990
Subbotin M.M. Hypertext. A new form of written communication. – VINITI, Ser. Computer Science, 1994, vol. 18
Baranov A.N. Introduction to Applied Linguistics. M., 2000

COURSE WORK

in the discipline "Informatics"

on the topic: “Computational linguistics”

INTRODUCTION

2. Modern interfaces for computational linguistics

CONCLUSION

LITERATURE

Introduction

Automated information technologies play an important role in the life of modern society. Over time, their importance continuously increases. But the development of information technology is very uneven: if the modern level of computer technology and communications is amazing, then in the field of semantic processing of information, successes are much more modest. These successes depend, first of all, on achievements in the study of the processes of human thinking, the processes of verbal communication between people and the ability to model these processes on a computer.

When it comes to creating promising information technologies, the problems of automatic processing of textual information presented in natural languages come to the fore. This is determined by the fact that a person’s thinking is closely connected with his language. Moreover, natural language is a tool for thinking. It is also a universal means of communication between people - a means of perception, accumulation, storage, processing and transmission of information. The science of computer linguistics deals with the problems of using natural language in automatic information processing systems. This science arose relatively recently - at the turn of the fifties and sixties of the last century. Over the past half century, significant scientific and practical results have been obtained in the field of computer linguistics: systems for machine translation of texts from one natural language to another, systems for automated information retrieval in texts, systems for automatic analysis and synthesis of oral speech, and many others have been created. This work is devoted to the construction of an optimal computer interface using computer linguistics when conducting linguistic research.

1. The place and role of computational linguistics in linguistic research

In the modern world, computational linguistics is increasingly being used to conduct various linguistic studies.

Computational linguistics is a field of knowledge associated with solving problems of automatic processing of information presented in natural language. The central scientific problems of computer linguistics are the problem of modeling the process of understanding the meaning of texts (transition from text to a formalized representation of its meaning) and the problem of speech synthesis (transition from a formalized representation of meaning to texts in natural language). These problems arise when solving a number of applied problems and, in particular, problems of automatic detection and correction of errors when entering texts into a computer, automatic analysis and synthesis of oral speech, automatic translation of texts from one language to another, communication with a computer in natural language, automatic classification and indexing of text documents, their automatic abstracting, searching for documents in full-text databases.

Linguistic tools created and used in computational linguistics can be divided into two parts: declarative and procedural. The declarative part includes dictionaries of units of language and speech, texts and various kinds of grammar tables, the procedural part includes means of manipulating units of language and speech, texts and grammar tables. Computer interface refers to the procedural part of computational linguistics.

Success in solving applied problems of computer linguistics depends, first of all, on the completeness and accuracy of the representation of declarative means in computer memory and on the quality of procedural means. To date, the required level of solving these problems has not yet been achieved, although work in the field of computational linguistics is being carried out in all developed countries of the world (Russia, USA, England, France, Germany, Japan, etc.).

Nevertheless, serious scientific and practical achievements in the field of computational linguistics can be noted. Thus, in a number of countries (Russia, USA, Japan, etc.) experimental and industrial systems for machine translation of texts from one language to another have been built, a number of experimental systems for communicating with computers in natural language have been built, work is underway to create terminological data banks, thesauruses, bilingual and multilingual machine dictionaries (Russia, USA, Germany, France, etc.), systems for automatic analysis and synthesis of oral speech are being built (Russia, USA, Japan, etc.), research is being conducted in the field of constructing natural language models.

An important methodological problem of applied computational linguistics is the correct assessment of the necessary relationship between the declarative and procedural components of automatic text information processing systems. What should be preferred: powerful computational procedures based on relatively small vocabulary systems with rich grammatical and semantic information, or a powerful declarative component with relatively simple computer interfaces? Most scientists believe that the second way is preferable. It will lead to the achievement of practical goals faster, since there will be fewer dead ends and difficult obstacles to overcome, and here it will be possible to use computers on a wider scale to automate research and development.

The need to mobilize efforts, first of all, on the development of the declarative component of automatic text information processing systems is confirmed by half a century of experience in the development of computer linguistics. After all, here, despite the undeniable successes of this science, the passion for algorithmic procedures has not brought the expected success. There was even some disappointment in the capabilities of procedural means.

In light of the above, it seems promising to develop such a path of development of computer linguistics, when the main efforts will be aimed at creating powerful dictionaries of language and speech units, studying their semantic-syntactic structure and creating basic procedures for morphological, semantic-syntactic and conceptual analysis and synthesis of texts. This will allow us to solve a wide range of applied problems in the future.

Computer linguistics faces, first of all, the tasks of linguistic support for the processes of collecting, accumulating, processing and retrieving information. The most important of them are:

1. Automation of the compilation and linguistic processing of machine dictionaries;

2. Automation of the processes of detecting and correcting errors when entering texts into a computer;

3. Automatic indexing of documents and information requests;

4. Automatic classification and abstracting of documents;

5. Linguistic support for information retrieval processes in monolingual and multilingual databases;

6. Machine translation of texts from one natural language to another;

7. Construction of linguistic processors that ensure user communication with automated intelligent information systems (in particular, expert systems) in natural language, or in a language close to natural;

8. Extracting factual information from informal texts.

Let us dwell in detail on the problems most relevant to the topic of research.

In the practical activities of information centers, there is a need to solve the problem of automated detection and correction of errors in texts when they are entered into a computer. This complex task can be conditionally divided into three tasks - tasks of orthographic, syntactic and semantic control of texts. The first of them can be solved using a morphological analysis procedure that uses a fairly powerful reference machine dictionary of word stems. In the process of spelling control, the words of the text are subject to morphological analysis, and if their bases are identified with the bases of the reference dictionary, then they are considered correct; if they are not identified, then they, accompanied by a microcontext, are presented to a person for viewing. A person detects and corrects distorted words, and the corresponding software system makes these corrections into the corrected text.

The task of syntactic control of texts in order to detect errors in them is much more difficult than the task of spelling control. Firstly, because it includes the task of spelling control as its obligatory component, and, secondly, because the problem of syntactic analysis of informal texts has not yet been fully resolved. However, partial syntactic control of texts is quite possible. Here you can go in two ways: either compile fairly representative machine dictionaries of reference syntactic structures and compare the syntactic structures of the analyzed text with them; or develop a complex system of rules for checking the grammatical consistency of text elements. The first path seems to us more promising, although it, of course, does not exclude the possibility of using elements of the second path. The syntactic structure of texts should be described in terms of grammatical classes of words (more precisely, in the form of sequences of sets of grammatical information for words).

The task of semantic control of texts in order to detect semantic errors in them should be classified as a class of artificial intelligence tasks. It can be solved in full only on the basis of modeling the processes of human thinking. In this case, it will apparently be necessary to create powerful encyclopedic knowledge bases and software tools for knowledge manipulation. Nevertheless, for limited subject areas and for formalized information, this task is completely solvable. It should be posed and solved as a problem of semantic-syntactic control of texts.

The problem of automating the indexing of documents and queries is traditional for automated text information retrieval systems. At first, indexing was understood as the process of assigning classification indices to documents and queries that reflected their thematic content. Subsequently, this concept was transformed and the term “indexing” began to refer to the process of translating descriptions of documents and queries from natural language into a formalized one, in particular, into the language of “search images”. Search images of documents began, as a rule, to be drawn up in the form of lists of keywords and phrases reflecting their thematic content, and search images of queries - in the form of logical structures in which keywords and phrases were connected to each other by logical and syntactic operators.

It is convenient to automatically index documents based on the texts of their abstracts (if any), since abstracts reflect the main content of documents in a concentrated form. Indexing can be carried out with or without thesaurus control. In the first case, in the text of the title of the document and its abstract, key words and phrases of the reference machine dictionary are searched and only those that are found in the dictionary are included in the AML. In the second case, key words and phrases are isolated from the text and included in the POD, regardless of their belonging to any reference dictionary. A third option was also implemented, where, along with terms from the machine thesaurus, the AML also included terms extracted from the title and first sentence of the document abstract. Experiments have shown that PODs compiled automatically using titles and abstracts of documents provide greater search completeness than PODs compiled manually. This is explained by the fact that the automatic indexing system more fully reflects various aspects of the content of documents than the manual indexing system.

Automatic indexing of queries poses approximately the same problems as automatic indexing of documents. Here you also have to extract keywords and phrases from the text and normalize the words included in the query text. Logical connections between keywords and phrases and contextual operators can be entered manually or using an automated procedure. An important element of the process of automatic indexing of a query is the addition of its constituent keywords and phrases with their synonyms and hyponyms (sometimes also hyperonyms and other terms associated with the original query terms). This can be done automatically or interactively using a machine thesaurus.

We have already partially considered the problem of automating the search for documentary information in connection with the task of automatic indexing. The most promising here is to search for documents using their full texts, since the use of all kinds of substitutes for this purpose (bibliographic descriptions, search images of documents and the texts of their abstracts) leads to loss of information during the search. The greatest losses occur when bibliographic descriptions are used as substitutes for primary documents, and the smallest losses occur when abstracts are used.

Important characteristics of the quality of information retrieval are its completeness and accuracy. The completeness of the search can be ensured by taking maximum account of the paradigmatic connections between units of language and speech (words and phrases), and accuracy - by taking into account their syntagmatic connections. There is an opinion that the completeness and accuracy of the search are inversely related: measures to improve one of these characteristics lead to a deterioration of the other. But this is only true for fixed search logic. If this logic is improved, then both characteristics can be improved simultaneously.

It is advisable to build the process of searching for information in full-text databases as a process of interactive communication between the user and the information retrieval system (IRS), in which he sequentially views text fragments (paragraphs) that satisfy the logical conditions of the request, and selects those that are relevant to him. are of interest. Both full texts of documents and any fragments of them can be returned as final search results.

As can be seen from the previous discussions, when automatically searching for information, it is necessary to overcome the language barrier that arises between the user and the information system due to the variety of forms of representation of the same meaning that occurs in texts. This barrier becomes even more significant if the search has to be carried out in multilingual databases. A radical solution to the problem here could be machine translation of document texts from one language to another. This can be done either in advance, before loading documents into a search engine, or during the process of searching for information. In the latter case, the user's request must be translated into the language of the document array in which the search is being conducted, and the search results must be translated into the language of the request. This kind of search engines already operate on the Internet. VINITI RAS also built a Cyrillic Browser system, which allows you to search for information in Russian-language texts using queries in English with search results also in the user’s language.

An important and promising task of computer linguistics is the construction of linguistic processors that ensure user communication with intelligent automated information systems (in particular, expert systems) in natural language or in a language close to natural. Since in modern intelligent systems information is stored in a formalized form, linguistic processors, acting as intermediaries between a person and a computer, must solve the following main tasks: 1) the task of transitioning from the texts of input information requests and messages in natural language to representing their meaning in a formalized language (when entering information into a computer); 2) the task of transition from a formalized representation of the meaning of output messages to its representation in natural language (when issuing information to a person). The first task must be solved by morphological, syntactic and conceptual analysis of input queries and messages, the second - by conceptual, syntactic and morphological synthesis of output messages.

Conceptual analysis of information requests and messages consists of identifying their conceptual structure (the boundaries of the names of concepts and relationships between concepts in the text) and translating this structure into a formalized language. It is carried out after morphological and syntactic analysis of requests and messages. The conceptual synthesis of messages consists of the transition from the representation of the elements of their structure in a formalized language to a verbal (verbal) representation. After this, the messages are given the necessary syntactic and morphological format.

For machine translation of texts from one natural language to another, it is necessary to have dictionaries of translation correspondence between the names of concepts. Knowledge about such translation correspondences was accumulated by many generations of people and was compiled in the form of special publications - bilingual or multilingual dictionaries. For specialists who have some knowledge of foreign languages, these dictionaries served as valuable aids in translating texts.

In traditional bilingual and multilingual general-purpose dictionaries, translation equivalents were indicated primarily for individual words, and for phrases - much less often. Indication of translation equivalents for phrases was more typical for special terminological dictionaries. Therefore, when translating sections of texts containing polysemantic words, students often encountered difficulties.

Below are translation correspondences between several pairs of English and Russian phrases on “school” topics.

1) The bat looks like a mouse with wings – The bat looks like a mouse with wings.

2) Children like to play in the sand on the beach - Children love to play in the sand on the seashore.

3) A drop of rain fell on my hand - A drop of rain fell on my hand.

4) Dry wood burns easily - dry wood burns well.

5) He pretended not to hear me - He pretended not to hear me.

Here the English phrases are not idiomatic expressions. However, their translation into Russian can only with some stretch be considered as a simple word-by-word translation, since almost all the words included in them are polysemantic. Therefore, only the achievements of computer linguistics can help students here.