taalunieversum

Direct naar menu
U bent hier: start » taal » technologie »

taal- en spraaktechnologienieuws feb'02

Taal- en spraaktechnologienieuws
********************************

1. Lezingenreeks TST-toepassingen in Leuven
2. Moedertaalsprekers van het Nederlands gezocht in Edinburgh
3. Oproep ELRA validatiecentra voor geschreven-taaldata

Meer nieuws over taal- en spraaktechnologie? Abonneer u op de gratis 
maandelijkse e-nieuwsbrief van Euromap Language Technologies, via 
http://www.hltcentral.org/lists/subscribe.php.


========================================
U ontvangt dit bericht omdat uw gegevens
zijn opgenomen in de databank van het
Platform voor het Nederlands in Taal- en
Spraaktechnologie (het TST-Platform, zie
ook http://www.taalunieversum.org/tst).
========================================

--------------------------------------------------------------------

1. LEZINGENREEKS TST-TOEPASSINGEN IN LEUVEN
*******************************************

In het kader van het Leuvense Speech and Language Technology programma 
vindt een reeks van lezingen plaats over Language Engineering Applications. 
Hieronder vindt u het overzicht van sprekers, onderwerpen en abstracts. 

Meer informatie: Frank Van Eynde (frank@ccl.kuleuven.ac.be).


                 LANGUAGE ENGINEERING APPLICATIONS
 G. Adriaens, D. Van Compernolle, F. Van Eynde & A. Van Wieringen
                       February - March,  2002

The course consists of a series of case studies describing a wide variety of 
applications in the domain of language engineering. In each of the lectures we 
look at what generic 'off-the-shelf' technology provides, but also at what is 
required to turn the core components into useful technology. There are twelve 
sessions of two hours.

Mo. 11 Feb  16.00-18.00   Corpus Construction and Annotation  
                          Frank Van Eynde
Tu. 12 Feb  15.30-17.30   Computational Lexicography
                          Frank Van Eynde
Mo. 18 Feb  16.00-18.00   Automatic Natural Language Processing 
                          by Ensembles of Machine Learning Agents 
                          The CGN Case
                          Antal van den Bosch (KUB, Tilburg)
Tu. 19 Feb  15.30-17.30   Dictation for Vertical Markets 
                          Dirk Van Compernolle
Mo. 25 Feb  16.00-18.00   From a Speech Recognition Research Engine 
                          to Reusable Product Code 
                          Jan Verhasselt (Scansoft, Ieper) 
Tu. 26 Feb  15.30-17.30   The Rise and Fall of L&H
                          Dirk Van Compernolle
Mo. 04 Mar  16.00-18.00   Aids for the Handicapped - I
                          Astrid Van Wieringen
Tu. 05 Mar  15.30-17.30   Aids for the Handicapped - II
                          Astrid Van Wieringen
Mo. 11 Mar  16.00-18.00   Druid
                          Data Retrieval Using Intelligent Disclosure
                          Arjan van Hessen (Univ. Twente)
Tu. 12 Mar  15.30-17.30   Proofing Tools
                          Geert Adriaens
Mo. 18 Mar  16.00-18.00   Multilingual Document Generation 
                          and Machine Translation
                          Geert Adriaens
Tu. 19 Mar  15.30-17.30   Multilingual Document Retrieval
                          and Summarization
                          Geert Adriaens

Monday: Auditorium De Oude Molen
        Kasteelpark Arenberg 50, Leuven (Heverlee), MOLE 00.00

Tuesday: Metaalkunde en Toegepaste Materiaalkunde
         Kasteelpark Arenberg 44, Leuven (Heverlee), MTM 00.39

01. CORPUS CONSTRUCTION and ANNOTATION
In order to enhance the performance of language engineering applications it is 
of vital importance to have access to large scale corpora, both of written and 
spoken language. The seminar discusses the main functions and characteristics 
of corpora and presents the standards and guidelines which are currently used 
for the construction and annotation of corpora. It also provides a survey of the 
most important English corpora and pays special attention to the compilation 
and the annotation of the Spoken Dutch Corpus (Corpus Gesproken 
Nederlands). This 10 million word corpus is currently being constructed  by a 
consortium of Dutch and Flemish research institutes, comprising a.o. the Center 
for Computational Linguistics (CCL) and the Speech-group (ESAT-PSI) of the 
K.U.Leuven, see also lecture 03. 

02. COMPUTATIONAL LEXICOGRAPHY
The construction of an appropriate lexicon is invariably one of the most labour-
intensive and expensive parts in the construction of NLP applications. For this 
reason, techniques have been developed for the acquisition, structuring, 
maintenance and reusability of lexical resources. Some of the more important 
techniques will be presented in the lecture. The presentation will be based on F. 
Van Eynde & D. Gibbon (eds.), Lexicon Development for Speech and Language 
Processing, Kluwer, 2000, xii + 298 pages.   

03. AUTOMATIC NATURAL LANGUAGE PROCESSING by ENSEMBLES of 
MACHINE LEARNING AGENTS 
In the past two decades, researchers in natural language processing have 
gradually embraced probabilistic and machine learning methods for the 
automatic construction of NLP systems. In many areas they have alleviated the 
knowledge acquisition bottleneck that plagued NLP before. On the other hand, 
due to their dependence on data, they have introduced a data acquisition 
bottleneck. In the context of Dutch NLP research this has been widely 
acknowledged; the Spoken Dutch Corpus (CGN) project is an example of an 
investment to accomodate the needs. Within the project itself, machine learning 
systems are already employed: they are trained on the growing amount of 
annotated material to assist the human annotators in processing more new 
material. In this talk I will highlight these particular systems, consisting of an 
ensemble of different machine-learned POS-taggers, and a machine-learned 
lemmatizer. Based on that, I will discuss the top issues in machine learning of 
NLP tasks currently on the international research agenda: representation, 
modularity, and scaling to more data.

04. DICTATION for VERTICAL MARKETS  
Many successful dictation applications are built for so called 'vertical' markets; 
i.e. the dictation software is optimized for a particular application domain such 
as medical (radiology, pathology), legal, public services (police reports, etc.). 
Adapting the general technology to such a particular domain requires significant 
modifications to a number of linguistic components: vocabulary and language 
model in particular. Furthermore the user interface may need to be adapted to 
allow for integration with the background information handling system. Adapting 
a generic technology for a specific application is a continuous evaluation of 
commercial potential and associated development costs.

05. From A SPEECH RECOGNITION RESEARCH ENGINE to REUSABLE PRODUCT 
CODE
Turning a speech recognition engine into a generic recognizer for commercial 
use requires effort at many levels. "Speech Recognition" is not an enduser 
product, but enabling technology. Therefore an application programming 
interface (API) layer must be provided, that allows customers to integrate 
speech input into their application. Also, the ideal that the speech recognizer 
itself is independent of the application has not been reached yet. Therefore a 
number of tools must be provided that let the product integrator (sometimes in 
cooperation with the speech recognition vendor) optimize the recognizer in light 
of the application. These tools include a grammar compiler, grammar editing 
tools, engine tuning tools, lexicon editing/tuning, grapheme-to-phoneme 
extension of the phonetic dictionary, support voor diverse character encodings, 
user words, spelling engine.

06. THE RISE AND FALL OF L&H
The fast collapse of L&H ending in bankruptcy late 2001 has given rise to plenty 
of negative press about the speech and language industry in general. The 
incredibly fast rise followed by an ever faster decline makes it very hard, even 
for insiders, to judge the true potential of the speech and language technology 
markets. The story of L&H was heavily influenced by two outside factors, for 
which it can hardly be blamed. First of all, PC technology caught up with the 
algorithms that had been ripening in research labs. That the fast introduction of 
new technology and progress of the late nineties was mainly due to Moore's law 
and would soon run out of steam by lack of fundamental progress could not be 
grasped by the industrial community. The second pillar of overly optimistic 
expectations was the technology bubble in general, which promised 
indefinite and unlimited growth for all new technologies. In this seminar we will 
analyze the history of a few product lines at L&H and confront the initial high 
flying expectations with the ultimate reality. We will discover in a number of 
instances that there was good reason for optimism about the future and that the 
progress that has been made is significant indeed. Nevertheless, today's 
business reality lags far behind these expectations. At the same time we must 
come to the conclusion that the speech and language technology market is far 
from dead. It is much smaller than some dreamed of, but clearly has a future.

07-08. AIDS for the HANDICAPPED   
The two lectures on this topic provide a practical guide to devices and services 
that will improve communication abilities of hearing impaired, visually impaired 
and speech/mobility impaired persons. For instance, hearing impaired persons 
can benefit from amplified telephones, assistive listening devices, and visual 
signalling and alerting devices, visually impaired persons from speech 
automatically translated to braille. The first lecture concerns the use of speech 
synthesis and speech recognition applications in daily life, as well as other aids 
for the hearing impaired, visually impaired, and speech/mobility impaired 
persons (e.g. tactile sensory aids, word prediction programs). The second 
lecture focusses on the hearing impaired only. After a brief introduction on the 
auditory system and hearing loss, the possibilities of hearing aids and cochlear 
implants are discussed in detail, together with different ways of assessing 
speech perception performance.

09. DRUID - DATA RETRIEVAL USING INTELLIGENT DISCLOSURE
Retrieval of information from large text corpora has become a mature science 
in the last years. Nowadays one can search for information with "natural" 
question phrases which will result in lists of documents that deal with the 
relevant topic, also if they do not literally contain the words of the query. What 
remains a problem, though, is that much of the relevant information is not 
available in a written format, but only as sound (speech) or image. In order to 
make this information retrievable, audio and/or video fragments have to be 
transformed into some kind of textual representation. This transformation is the 
central topic of the DRUID project. The lecture will focus on the speech 
recognition and language modelling aspects of the work. 

10. PROOFING TOOLS
The most widely spread language engineering technology is without any doubt 
the spelling checkers in the word processors. In just a few years these have 
evolved from technologically advanced and high priced technology to an almost 
freely available commodity. More at the forefront of technology today are 
grammar and style checkers. These may not only be used to enforce 
grammatically correct language, but may help to create documents that are 
deemed of acceptable quality for publishing or for later use in other language 
engineering applications (such as machine translation).

11. MULTILINGUAL DOCUMENT GENERATION and MACHINE TRANSLATION
With the world becoming a global village few expect (hope) that everyone will 
speak English within one or two generations. This is unlikely to be the case. 
However, the increase in communication will increase the need for automatic 
translation. It is unlikely that the EC will keep manually translating all its reports 
in all languages as the EC members become more numerous. It is quite sure 
that once 'acceptable quality' automatic translation is available, this will be the 
preferred access method to documents generated in other languages. Generic 
machine translation can be enhanced by optimizing the underlying grammars 
for certain domains and most of all by limiting the freedom of the document 
creator. The latter is only possible for manuals in large corporations or for 
documents in official organizations, but eg not for all what gets published on the 
web. Staying within a somewhat constrained style guarantees that the analysis 
in the first phase of the translation process is likely to be correct and has 
therefore a higher degree of success.

12. MULTILINGUAL DOCUMENT RETRIEVAL and SUMMARIZATION   
Translation of a document is only one part of the solution in a global information 
society. In the case of search for information one needs to know first which 
documents are worth translation and reading. Thus a multilingual search and 
information summarization are typically the first steps in defining the 
documents of interest. Instead of translation of full documents the emphasis is 
on translating queries, document indices etc.

--------------------------------------------------------------------

2. MOEDERTAALSPREKERS VAN HET NEDERLANDS GEZOCHT IN EDINBURGH
*************************************************************

Rhetorical Systems is looking for native speakers of Danish, Dutch, Italian, 
Japanese, Mandarin, Norwegian and Swedish to help them prepare foreign 
language texts for a text-to-speech system. The work will be full-time (30 
hours/week or more), based in Edinburgh, Scotland, and is expected to start in 
March 2002 or sooner and last for approximately 3 months. The work is ideal 
for students looking for work experience.

Native ability of one of the above languages and a good working knowledge of 
phonetics and phonology are essential, as is experience of using computers in a 
unix/linux environment. In addition to this, the ability to use the internet as a 
research tool is highly desirable.

Rhetorical Systems is a speech technology company based in Edinburgh, 
Scotland. We have close links with Edinburgh University, which has world-
famous departments for Linguistics, Cognitive Science and Artificial Intelligence. 
The work programme of 30 hours per week will leave you with plenty of time to 
attend seminars and workshops in this stimulating intellectual environment.

If you are interested, mail laurence.molloy@rhetorical.com with a description of 
your background and experience.

Laurence Molloy
Rhetorical Systems Ltd
4 Crichton's Close, Edinburgh EH8 8DT
Email: laurence.molloy@rhetorical.com

Check us out at www.rhetorical.com and try out our Demo.

--------------------------------------------------------------------

3. OPROEP ELRA VALIDATIECENTRA VOOR GESCHREVEN-TAALDATA
*******************************************************

Our apologies if you receive multiple copies

CALL FOR CREATING A NETWORK OF TECHNICAL CENTERS FOR WRITTEN 
LANGUAGE RESOURCES VALIDATION

1. Preamble

Describing, assuring and improving the quality of language resources are 
important tasks. The assurance of such quality is an important factor in ELRA's 
success. In the start up phase of ELRA it was foreseen that a Network of 
Technical Centers should be established to handle quality control. To date a 
technical center for the validation of spoken language resources has been 
established. ELRA now intends to initiate the establishment of a network of 
technical centers for the validation of written language resources, the Validation 
Centers for Written Language Resources or VC_WLR.  Written resources include 
lexicons as well as text corpora, possibly enriched with all kinds of annotations 
(POS-tags, syntactic structures, etc.). The procedure to establish the VC_WLR is 
identical to the one adopted in establishing the technical centers for spoken 
language resources, viz. they are to be established via an open call. Those 
European institutions willing to act as a VC_WLR for ELRA should send an offer 
to ELRA. The contents of this offer are described below. In particular, the offer 
must contain a proposal on how to address the problem of the detailed and 
thorough knowledge of a wide variety of languages required by the validation of 
multilingual resources.
ELRA's Board will decide which institutions will be selected. The selection of 
each candidate institution will be based on its ability to fulfill the tasks described 
in Section2. The organizational and financial aspects are described in Section 3.

2. Work packages (WP) of the VC_WLR

2.1 Extending the Methodology for Describing the Quality and Content of 
Existing WLR 
In the catalogue of ELRA many WLR are offered whose quality and content is 
not yet described in a satisfactory way. Some projects have resulted in 
linguistic resources distributed by ELRA that are comparable across languages 
in accordance with a commonly agreed content and format specification (e.g. 
PAROLE). However, almost no written data distributed by ELRA have been 
subject to validation by an external party and in accordance with a commonly 
agreed validation scheme (except for a limited number of PAROLE lexicons, and 
recently in the context of the ENABLER project). Though some research into the 
validation of linguistic resources has taken place and recommendations and 
guidelines have been formulated (e.g. Nancy Underwood et al., June 1998; Lou 
Burnard for text corpora), these have to be reviewed and where necessary 
adapted and extended to develop a concrete and workable methodology for the 
ELRA validation of written linguistic resources. The knowledge and expertise 
gained in the successful approach to validation taken in the SpeechDat family of 
spoken resources and by the existing ELRA validation center for spoken 
resources could be taken into consideration here, and its methods and 
approaches translated into an approach adapted for written language resources 
while maintaining the key elements that determined the success of the approach to speech.
The first task of the VC_WLR is to establish and/or extend the methodology for 
quality and content description so far developed. The related document should 
focus on the quality and content of the WLR offered in the ELRA catalogue. A 
standard form should be developed for describing the content and quality of a 
WLR, starting from the form currently in use and taking into account the work 
carried out within TEI, OLAC, etc. The WLR in the ELRA catalog will have to be 
described according to this standard. This description will be used as a basis for 
providing any (potential) user with a quick overview in the ELRA catalogue 
relating to the quality and content of each WLR offered.
Output of WP2.1: 
- Document describing methodology concerning quality and content
- Content and quality description of all ELRA WLR

2.2 Improving the Quality of Existing WLR 
Existing WLR may have errors that could be removed with reasonable effort. 
The task of the VC_WLR is to establish a procedure to remove these errors. 
Especially a procedure has to be established which handles the errors reported 
by users of WLR (bug reporting procedure). Further, the existing WLR can be 
improved by better documentation, by reformatting according to established 
standards and by content changes. A similar procedure for spoken language 
resources has been proposed and is currently being implemented and 
experimented with, hence it is sensible to investigate to what extent the 
procedure proposed for SLR can be adopted for the improvement of WLR and 
what modifications and or  extensions are necessary or desirable. The quality of 
the existing WLR should be gradually improved in accordance with a priority 
scheme that has to be worked out in close cooperation with ELRA's validation 
committee. The scheme has to be approved by the ELRA board.
Output of WP 2.2:
- Report describing the procedure to be used to improve existing WLR
- Improve existing WLR according to a priority scheme 

2.3 Quality Standards for WLR
The VC_WLR have to play a leading role in establishing quality standards for  
WLR. for this task the VC_WLR have to cooperate with organizations involved in 
the production of WLR such as the consortia of the PAROLE and SIMPLE 
projects, and with ELRA's distribution agency (currently ELDA). Additionally, the 
extent to which existing recommendations, guidelines and proposed standards 
from groups such as the EAGLES and ISLE projects can be incorporated should 
be considered throughout.
Output of WP 2.3:
- Report describing the procedure for building up relationships with significant 
WLR producers and standards groups 
- Following on from the report, the establishment of those relationships

2.4 Validation of New WLR
Owners of WLR regularly offer their WLR to ELRA for distribution. ELRA has the 
distribution carried out by its distribution agency (currently ELDA). Each time a 
WLR is offered for distribution, the task of the VC_WLR is to establish in 
cooperation with the owner of the WLR a manual containing:
- The specification of the content of the WLR,
- The validation criteria for checking the quality of the WLR,
- The procedure to validate the WLR.
Based on this manual the VC_WLR have to validate any new WLR offered for 
distribution.
Output of WP 2.4:
- Report on the validation procedure as specified in a specific contract between 
ELDA and the center(s)

2.5 Reporting
Twice a year the VC_WLR must report work undertaken to date to the board of 
ELRA via the head of the validation committee.
Output of WP 2.5:
- Status reports 

3. Organizational and Financial Issues

3.1 Relation between ELRA and VC_WLR
Concerning the tasks 2.1, 2.2, 2.3, 2.5 as described above the relation between 
ELRA and the institution(s) that are appointed as VC_WLR will be regulated by a 
contract between ELRA and those institutions. The contract has to be renewed 
after every fiscal year of ELRA by the Board of ELRA. Three months before the 
end of each fiscal year of ELRA the Board of ELRA will decide on the financial 
support to be given to the VC_WLR for the next fiscal year to perform the tasks 
2.1, 2.2, 2.3, 2.5. Annually, a letter of intent will describe a budget for the year 
for the VC_WLR. The initial amount made available will be approximately 15K 
EUR.
The ELRA validation committee will act as a steering committee for all activities 
related to validation of written resources. All actions proposed by the validation 
committee and agreed upon between the validation committee and the 
appointed VC_WLR will have to be approved by the ELRA Board.

3.2 Relation between ELDA and the VC_WLR
Separate contracts will be made with ELDA concerning task 2.4 on a case-by-
case basis. 

4. Format and Procedure for Offer

To apply to be a VC_WLR, send your offer by e-mail (as ASCII or RTF files, 
approx. 2000 words) to the CEO of ELRA (Khalid Choukri, choukri@elda.fr and 
to the head of the ELRA validation committee (Harald Hoege, 
harald.hoege@mchp.siemens.de). The e-mail should contain:
1. Name of the proposing institute.
2. The name of the person at the institute who will be the head of the VC_WLR.
3. A statement outlining the suitability of the institute to act as a VC_WLR.
4. A proposal on how the institute plans to provide for the required detailed and 
thorough knowledge of a wide variety of languages. 
5. A list of personnel who will work on the tasks to be undertaken by the 
VC_WLR.
6. A possible start date.
7. Sketch of the work for the work packages described that can be carried out 
within the fiscal year 2002 (1.1.02 31.12.02) for a budget of inferior or equal to 
15KEUR. For each work package a rough estimate for the costs should be given.

Proposals are due by Friday March 1, 2002.

ELRA/ELDA
55-57, rue Brillat Savarin
75013 Paris
France
Tel.: +33 1 43 13 33 33
Fax: +33 1 43 13 33 30
Email: choukri@elda.fr

--------------------------------------------------------------------
© Nederlandse Taalunie, 2000-2008 alle rechten voorbehouden
WegwijzerColofonContactVrijwaringOpmerkingen en reacties