Taal- en spraaktechnologienieuws
********************************
1. Lezingenreeks TST-toepassingen in Leuven
2. Moedertaalsprekers van het Nederlands gezocht in Edinburgh
3. Oproep ELRA validatiecentra voor geschreven-taaldata
Meer nieuws over taal- en spraaktechnologie? Abonneer u op de gratis
maandelijkse e-nieuwsbrief van Euromap Language Technologies, via
http://www.hltcentral.org/lists/subscribe.php.
========================================
U ontvangt dit bericht omdat uw gegevens
zijn opgenomen in de databank van het
Platform voor het Nederlands in Taal- en
Spraaktechnologie (het TST-Platform, zie
ook http://www.taalunieversum.org/tst).
========================================
--------------------------------------------------------------------
1. LEZINGENREEKS TST-TOEPASSINGEN IN LEUVEN
*******************************************
In het kader van het Leuvense Speech and Language Technology programma
vindt een reeks van lezingen plaats over Language Engineering Applications.
Hieronder vindt u het overzicht van sprekers, onderwerpen en abstracts.
Meer informatie: Frank Van Eynde (frank@ccl.kuleuven.ac.be).
LANGUAGE ENGINEERING APPLICATIONS
G. Adriaens, D. Van Compernolle, F. Van Eynde & A. Van Wieringen
February - March, 2002
The course consists of a series of case studies describing a wide variety of
applications in the domain of language engineering. In each of the lectures we
look at what generic 'off-the-shelf' technology provides, but also at what is
required to turn the core components into useful technology. There are twelve
sessions of two hours.
Mo. 11 Feb 16.00-18.00 Corpus Construction and Annotation
Frank Van Eynde
Tu. 12 Feb 15.30-17.30 Computational Lexicography
Frank Van Eynde
Mo. 18 Feb 16.00-18.00 Automatic Natural Language Processing
by Ensembles of Machine Learning Agents
The CGN Case
Antal van den Bosch (KUB, Tilburg)
Tu. 19 Feb 15.30-17.30 Dictation for Vertical Markets
Dirk Van Compernolle
Mo. 25 Feb 16.00-18.00 From a Speech Recognition Research Engine
to Reusable Product Code
Jan Verhasselt (Scansoft, Ieper)
Tu. 26 Feb 15.30-17.30 The Rise and Fall of L&H
Dirk Van Compernolle
Mo. 04 Mar 16.00-18.00 Aids for the Handicapped - I
Astrid Van Wieringen
Tu. 05 Mar 15.30-17.30 Aids for the Handicapped - II
Astrid Van Wieringen
Mo. 11 Mar 16.00-18.00 Druid
Data Retrieval Using Intelligent Disclosure
Arjan van Hessen (Univ. Twente)
Tu. 12 Mar 15.30-17.30 Proofing Tools
Geert Adriaens
Mo. 18 Mar 16.00-18.00 Multilingual Document Generation
and Machine Translation
Geert Adriaens
Tu. 19 Mar 15.30-17.30 Multilingual Document Retrieval
and Summarization
Geert Adriaens
Monday: Auditorium De Oude Molen
Kasteelpark Arenberg 50, Leuven (Heverlee), MOLE 00.00
Tuesday: Metaalkunde en Toegepaste Materiaalkunde
Kasteelpark Arenberg 44, Leuven (Heverlee), MTM 00.39
01. CORPUS CONSTRUCTION and ANNOTATION
In order to enhance the performance of language engineering applications it is
of vital importance to have access to large scale corpora, both of written and
spoken language. The seminar discusses the main functions and characteristics
of corpora and presents the standards and guidelines which are currently used
for the construction and annotation of corpora. It also provides a survey of the
most important English corpora and pays special attention to the compilation
and the annotation of the Spoken Dutch Corpus (Corpus Gesproken
Nederlands). This 10 million word corpus is currently being constructed by a
consortium of Dutch and Flemish research institutes, comprising a.o. the Center
for Computational Linguistics (CCL) and the Speech-group (ESAT-PSI) of the
K.U.Leuven, see also lecture 03.
02. COMPUTATIONAL LEXICOGRAPHY
The construction of an appropriate lexicon is invariably one of the most labour-
intensive and expensive parts in the construction of NLP applications. For this
reason, techniques have been developed for the acquisition, structuring,
maintenance and reusability of lexical resources. Some of the more important
techniques will be presented in the lecture. The presentation will be based on F.
Van Eynde & D. Gibbon (eds.), Lexicon Development for Speech and Language
Processing, Kluwer, 2000, xii + 298 pages.
03. AUTOMATIC NATURAL LANGUAGE PROCESSING by ENSEMBLES of
MACHINE LEARNING AGENTS
In the past two decades, researchers in natural language processing have
gradually embraced probabilistic and machine learning methods for the
automatic construction of NLP systems. In many areas they have alleviated the
knowledge acquisition bottleneck that plagued NLP before. On the other hand,
due to their dependence on data, they have introduced a data acquisition
bottleneck. In the context of Dutch NLP research this has been widely
acknowledged; the Spoken Dutch Corpus (CGN) project is an example of an
investment to accomodate the needs. Within the project itself, machine learning
systems are already employed: they are trained on the growing amount of
annotated material to assist the human annotators in processing more new
material. In this talk I will highlight these particular systems, consisting of an
ensemble of different machine-learned POS-taggers, and a machine-learned
lemmatizer. Based on that, I will discuss the top issues in machine learning of
NLP tasks currently on the international research agenda: representation,
modularity, and scaling to more data.
04. DICTATION for VERTICAL MARKETS
Many successful dictation applications are built for so called 'vertical' markets;
i.e. the dictation software is optimized for a particular application domain such
as medical (radiology, pathology), legal, public services (police reports, etc.).
Adapting the general technology to such a particular domain requires significant
modifications to a number of linguistic components: vocabulary and language
model in particular. Furthermore the user interface may need to be adapted to
allow for integration with the background information handling system. Adapting
a generic technology for a specific application is a continuous evaluation of
commercial potential and associated development costs.
05. From A SPEECH RECOGNITION RESEARCH ENGINE to REUSABLE PRODUCT
CODE
Turning a speech recognition engine into a generic recognizer for commercial
use requires effort at many levels. "Speech Recognition" is not an enduser
product, but enabling technology. Therefore an application programming
interface (API) layer must be provided, that allows customers to integrate
speech input into their application. Also, the ideal that the speech recognizer
itself is independent of the application has not been reached yet. Therefore a
number of tools must be provided that let the product integrator (sometimes in
cooperation with the speech recognition vendor) optimize the recognizer in light
of the application. These tools include a grammar compiler, grammar editing
tools, engine tuning tools, lexicon editing/tuning, grapheme-to-phoneme
extension of the phonetic dictionary, support voor diverse character encodings,
user words, spelling engine.
06. THE RISE AND FALL OF L&H
The fast collapse of L&H ending in bankruptcy late 2001 has given rise to plenty
of negative press about the speech and language industry in general. The
incredibly fast rise followed by an ever faster decline makes it very hard, even
for insiders, to judge the true potential of the speech and language technology
markets. The story of L&H was heavily influenced by two outside factors, for
which it can hardly be blamed. First of all, PC technology caught up with the
algorithms that had been ripening in research labs. That the fast introduction of
new technology and progress of the late nineties was mainly due to Moore's law
and would soon run out of steam by lack of fundamental progress could not be
grasped by the industrial community. The second pillar of overly optimistic
expectations was the technology bubble in general, which promised
indefinite and unlimited growth for all new technologies. In this seminar we will
analyze the history of a few product lines at L&H and confront the initial high
flying expectations with the ultimate reality. We will discover in a number of
instances that there was good reason for optimism about the future and that the
progress that has been made is significant indeed. Nevertheless, today's
business reality lags far behind these expectations. At the same time we must
come to the conclusion that the speech and language technology market is far
from dead. It is much smaller than some dreamed of, but clearly has a future.
07-08. AIDS for the HANDICAPPED
The two lectures on this topic provide a practical guide to devices and services
that will improve communication abilities of hearing impaired, visually impaired
and speech/mobility impaired persons. For instance, hearing impaired persons
can benefit from amplified telephones, assistive listening devices, and visual
signalling and alerting devices, visually impaired persons from speech
automatically translated to braille. The first lecture concerns the use of speech
synthesis and speech recognition applications in daily life, as well as other aids
for the hearing impaired, visually impaired, and speech/mobility impaired
persons (e.g. tactile sensory aids, word prediction programs). The second
lecture focusses on the hearing impaired only. After a brief introduction on the
auditory system and hearing loss, the possibilities of hearing aids and cochlear
implants are discussed in detail, together with different ways of assessing
speech perception performance.
09. DRUID - DATA RETRIEVAL USING INTELLIGENT DISCLOSURE
Retrieval of information from large text corpora has become a mature science
in the last years. Nowadays one can search for information with "natural"
question phrases which will result in lists of documents that deal with the
relevant topic, also if they do not literally contain the words of the query. What
remains a problem, though, is that much of the relevant information is not
available in a written format, but only as sound (speech) or image. In order to
make this information retrievable, audio and/or video fragments have to be
transformed into some kind of textual representation. This transformation is the
central topic of the DRUID project. The lecture will focus on the speech
recognition and language modelling aspects of the work.
10. PROOFING TOOLS
The most widely spread language engineering technology is without any doubt
the spelling checkers in the word processors. In just a few years these have
evolved from technologically advanced and high priced technology to an almost
freely available commodity. More at the forefront of technology today are
grammar and style checkers. These may not only be used to enforce
grammatically correct language, but may help to create documents that are
deemed of acceptable quality for publishing or for later use in other language
engineering applications (such as machine translation).
11. MULTILINGUAL DOCUMENT GENERATION and MACHINE TRANSLATION
With the world becoming a global village few expect (hope) that everyone will
speak English within one or two generations. This is unlikely to be the case.
However, the increase in communication will increase the need for automatic
translation. It is unlikely that the EC will keep manually translating all its reports
in all languages as the EC members become more numerous. It is quite sure
that once 'acceptable quality' automatic translation is available, this will be the
preferred access method to documents generated in other languages. Generic
machine translation can be enhanced by optimizing the underlying grammars
for certain domains and most of all by limiting the freedom of the document
creator. The latter is only possible for manuals in large corporations or for
documents in official organizations, but eg not for all what gets published on the
web. Staying within a somewhat constrained style guarantees that the analysis
in the first phase of the translation process is likely to be correct and has
therefore a higher degree of success.
12. MULTILINGUAL DOCUMENT RETRIEVAL and SUMMARIZATION
Translation of a document is only one part of the solution in a global information
society. In the case of search for information one needs to know first which
documents are worth translation and reading. Thus a multilingual search and
information summarization are typically the first steps in defining the
documents of interest. Instead of translation of full documents the emphasis is
on translating queries, document indices etc.
--------------------------------------------------------------------
2. MOEDERTAALSPREKERS VAN HET NEDERLANDS GEZOCHT IN EDINBURGH
*************************************************************
Rhetorical Systems is looking for native speakers of Danish, Dutch, Italian,
Japanese, Mandarin, Norwegian and Swedish to help them prepare foreign
language texts for a text-to-speech system. The work will be full-time (30
hours/week or more), based in Edinburgh, Scotland, and is expected to start in
March 2002 or sooner and last for approximately 3 months. The work is ideal
for students looking for work experience.
Native ability of one of the above languages and a good working knowledge of
phonetics and phonology are essential, as is experience of using computers in a
unix/linux environment. In addition to this, the ability to use the internet as a
research tool is highly desirable.
Rhetorical Systems is a speech technology company based in Edinburgh,
Scotland. We have close links with Edinburgh University, which has world-
famous departments for Linguistics, Cognitive Science and Artificial Intelligence.
The work programme of 30 hours per week will leave you with plenty of time to
attend seminars and workshops in this stimulating intellectual environment.
If you are interested, mail laurence.molloy@rhetorical.com with a description of
your background and experience.
Laurence Molloy
Rhetorical Systems Ltd
4 Crichton's Close, Edinburgh EH8 8DT
Email: laurence.molloy@rhetorical.com
Check us out at www.rhetorical.com and try out our Demo.
--------------------------------------------------------------------
3. OPROEP ELRA VALIDATIECENTRA VOOR GESCHREVEN-TAALDATA
*******************************************************
Our apologies if you receive multiple copies
CALL FOR CREATING A NETWORK OF TECHNICAL CENTERS FOR WRITTEN
LANGUAGE RESOURCES VALIDATION
1. Preamble
Describing, assuring and improving the quality of language resources are
important tasks. The assurance of such quality is an important factor in ELRA's
success. In the start up phase of ELRA it was foreseen that a Network of
Technical Centers should be established to handle quality control. To date a
technical center for the validation of spoken language resources has been
established. ELRA now intends to initiate the establishment of a network of
technical centers for the validation of written language resources, the Validation
Centers for Written Language Resources or VC_WLR. Written resources include
lexicons as well as text corpora, possibly enriched with all kinds of annotations
(POS-tags, syntactic structures, etc.). The procedure to establish the VC_WLR is
identical to the one adopted in establishing the technical centers for spoken
language resources, viz. they are to be established via an open call. Those
European institutions willing to act as a VC_WLR for ELRA should send an offer
to ELRA. The contents of this offer are described below. In particular, the offer
must contain a proposal on how to address the problem of the detailed and
thorough knowledge of a wide variety of languages required by the validation of
multilingual resources.
ELRA's Board will decide which institutions will be selected. The selection of
each candidate institution will be based on its ability to fulfill the tasks described
in Section2. The organizational and financial aspects are described in Section 3.
2. Work packages (WP) of the VC_WLR
2.1 Extending the Methodology for Describing the Quality and Content of
Existing WLR
In the catalogue of ELRA many WLR are offered whose quality and content is
not yet described in a satisfactory way. Some projects have resulted in
linguistic resources distributed by ELRA that are comparable across languages
in accordance with a commonly agreed content and format specification (e.g.
PAROLE). However, almost no written data distributed by ELRA have been
subject to validation by an external party and in accordance with a commonly
agreed validation scheme (except for a limited number of PAROLE lexicons, and
recently in the context of the ENABLER project). Though some research into the
validation of linguistic resources has taken place and recommendations and
guidelines have been formulated (e.g. Nancy Underwood et al., June 1998; Lou
Burnard for text corpora), these have to be reviewed and where necessary
adapted and extended to develop a concrete and workable methodology for the
ELRA validation of written linguistic resources. The knowledge and expertise
gained in the successful approach to validation taken in the SpeechDat family of
spoken resources and by the existing ELRA validation center for spoken
resources could be taken into consideration here, and its methods and
approaches translated into an approach adapted for written language resources
while maintaining the key elements that determined the success of the approach to speech.
The first task of the VC_WLR is to establish and/or extend the methodology for
quality and content description so far developed. The related document should
focus on the quality and content of the WLR offered in the ELRA catalogue. A
standard form should be developed for describing the content and quality of a
WLR, starting from the form currently in use and taking into account the work
carried out within TEI, OLAC, etc. The WLR in the ELRA catalog will have to be
described according to this standard. This description will be used as a basis for
providing any (potential) user with a quick overview in the ELRA catalogue
relating to the quality and content of each WLR offered.
Output of WP2.1:
- Document describing methodology concerning quality and content
- Content and quality description of all ELRA WLR
2.2 Improving the Quality of Existing WLR
Existing WLR may have errors that could be removed with reasonable effort.
The task of the VC_WLR is to establish a procedure to remove these errors.
Especially a procedure has to be established which handles the errors reported
by users of WLR (bug reporting procedure). Further, the existing WLR can be
improved by better documentation, by reformatting according to established
standards and by content changes. A similar procedure for spoken language
resources has been proposed and is currently being implemented and
experimented with, hence it is sensible to investigate to what extent the
procedure proposed for SLR can be adopted for the improvement of WLR and
what modifications and or extensions are necessary or desirable. The quality of
the existing WLR should be gradually improved in accordance with a priority
scheme that has to be worked out in close cooperation with ELRA's validation
committee. The scheme has to be approved by the ELRA board.
Output of WP 2.2:
- Report describing the procedure to be used to improve existing WLR
- Improve existing WLR according to a priority scheme
2.3 Quality Standards for WLR
The VC_WLR have to play a leading role in establishing quality standards for
WLR. for this task the VC_WLR have to cooperate with organizations involved in
the production of WLR such as the consortia of the PAROLE and SIMPLE
projects, and with ELRA's distribution agency (currently ELDA). Additionally, the
extent to which existing recommendations, guidelines and proposed standards
from groups such as the EAGLES and ISLE projects can be incorporated should
be considered throughout.
Output of WP 2.3:
- Report describing the procedure for building up relationships with significant
WLR producers and standards groups
- Following on from the report, the establishment of those relationships
2.4 Validation of New WLR
Owners of WLR regularly offer their WLR to ELRA for distribution. ELRA has the
distribution carried out by its distribution agency (currently ELDA). Each time a
WLR is offered for distribution, the task of the VC_WLR is to establish in
cooperation with the owner of the WLR a manual containing:
- The specification of the content of the WLR,
- The validation criteria for checking the quality of the WLR,
- The procedure to validate the WLR.
Based on this manual the VC_WLR have to validate any new WLR offered for
distribution.
Output of WP 2.4:
- Report on the validation procedure as specified in a specific contract between
ELDA and the center(s)
2.5 Reporting
Twice a year the VC_WLR must report work undertaken to date to the board of
ELRA via the head of the validation committee.
Output of WP 2.5:
- Status reports
3. Organizational and Financial Issues
3.1 Relation between ELRA and VC_WLR
Concerning the tasks 2.1, 2.2, 2.3, 2.5 as described above the relation between
ELRA and the institution(s) that are appointed as VC_WLR will be regulated by a
contract between ELRA and those institutions. The contract has to be renewed
after every fiscal year of ELRA by the Board of ELRA. Three months before the
end of each fiscal year of ELRA the Board of ELRA will decide on the financial
support to be given to the VC_WLR for the next fiscal year to perform the tasks
2.1, 2.2, 2.3, 2.5. Annually, a letter of intent will describe a budget for the year
for the VC_WLR. The initial amount made available will be approximately 15K
EUR.
The ELRA validation committee will act as a steering committee for all activities
related to validation of written resources. All actions proposed by the validation
committee and agreed upon between the validation committee and the
appointed VC_WLR will have to be approved by the ELRA Board.
3.2 Relation between ELDA and the VC_WLR
Separate contracts will be made with ELDA concerning task 2.4 on a case-by-
case basis.
4. Format and Procedure for Offer
To apply to be a VC_WLR, send your offer by e-mail (as ASCII or RTF files,
approx. 2000 words) to the CEO of ELRA (Khalid Choukri, choukri@elda.fr and
to the head of the ELRA validation committee (Harald Hoege,
harald.hoege@mchp.siemens.de). The e-mail should contain:
1. Name of the proposing institute.
2. The name of the person at the institute who will be the head of the VC_WLR.
3. A statement outlining the suitability of the institute to act as a VC_WLR.
4. A proposal on how the institute plans to provide for the required detailed and
thorough knowledge of a wide variety of languages.
5. A list of personnel who will work on the tasks to be undertaken by the
VC_WLR.
6. A possible start date.
7. Sketch of the work for the work packages described that can be carried out
within the fiscal year 2002 (1.1.02 31.12.02) for a budget of inferior or equal to
15KEUR. For each work package a rough estimate for the costs should be given.
Proposals are due by Friday March 1, 2002.
ELRA/ELDA
55-57, rue Brillat Savarin
75013 Paris
France
Tel.: +33 1 43 13 33 33
Fax: +33 1 43 13 33 30
Email: choukri@elda.fr
--------------------------------------------------------------------
©
Nederlandse Taalunie, 2000-2008 alle rechten voorbehouden
Wegwijzer – Colofon – Contact – Vrijwaring – Opmerkingen en reacties
Wegwijzer – Colofon – Contact – Vrijwaring – Opmerkingen en reacties