SERISS thesaurus evaluation: final results

Lorna Balkan

Project aims

In recent years, ELSST has benefited from two strands of funding. Development work on the thesaurus content and software has been funded by the ESRC-funded CESSDA-ELSST project, and continues under the EU-funded CESSDA Vocabulary Services Multilingual Content Management (CESSDA VOICE) project. Complementary work on the assessment of the translation quality of terms has been carried out as part of the Synergies for Europe’s Research Infrastructures in the Social Sciences (SERISS) project, which is also funded by the EU. The SERISS work has now been completed.

The SERISS project investigated two methods for assessing the translation quality of ELSST terms. The first method was back-translation, which was performed on a subset of ELSST terms in two of the target languages: French and German. The work was very labour-intensive, but helped to identify cases where translations differed semantically and stylistically from the source terms. This was reported previously (see the First results of SERISS project).

The second quality assessment method, discussed here, involved comparing the set of index terms assigned to the same resource in different languages. Two types of resource were chosen: whole studies versus individual questions.

Whole-study indexing

For the whole-study indexing, a comparison was made of the ELSST terms used to index the same set of cross-national surveys in different languages. The indexing had been carried out previously by members of Consortium of European Social Science Data Archives (CESSDA ERIC) or CESSDA-related archives, according to their own indexing procedures. These procedures differed widely, with some archives assigning index terms at more granular levels than others. Consequently the results, discussed in First results of SERISS project, were difficult to compare. Moreover, whole study indexing produces an alphabetic list of terms where it is difficult to see which term relates to which part of a study.

Question indexing

Indexing individual questions is not currently practised by any of the CESSDA and CESSDA-related archives, but a small sample was selected and indexed as part of the SERISS project. The questions were taken from three surveys (the European Social Survey (ESS), the European Values Study (EVS) and the Survey of Health, Ageing and Retirement in Europe (SHARE)), and indexed with ELSST terms in German, Greek and Romanian. As before, the results were analysed, not only to compare how consistent the indexing was between indexers (and thereby uncover any potential problems with the terms or their translations), but also to see how well they covered the semantic content expressed in the questions.

This time, it was easier to see how the terms related to the object indexed (i.e. the question text) and to identify problems such as ambiguity (where a term’s meaning was not clear in the source and/or target language) and redundancy (where there was too great an overlap of meaning between two or more terms in the same language). However, other factors besides the properties of the terms themselves influenced the indexers’ choice of terms. In some cases, differences in how questions were worded in the different languages had an impact on the indexing terms chosen, and despite the fact that all indexers were using the same indexing instructions, each indexer interpreted them slightly differently. The experiment also revealed cases where the semantic content of the questions could not be adequately covered by ELSST terms, resulting in some new term suggestions. More details can be found in the Report on application of indexing terms in the data lifecycle.

Project impact

Overall, the SERISS work proved valuable in highlighting issues with ELSST terms and their translations. These issues cover semantic as well as more formal/stylistic aspects of terms. The results have been used to produce Guidelines for the management of ELSST content, which in turn has been used to update ELSST translation guidelines and training, and inform ongoing thesaurus development and translation work.

Besides improving the translation quality of ELSST terms, the SERISS work will also be of interest to those investigating how to index data, i.e. what to index (whole studies, questions and/or variables) and where in the data lifecycle indexing should be carried out.

Results from the SERISS thesaurus evaluation

Advertisements
Posted in Uncategorized | Leave a comment

New guidelines on ELSST content

Lorna Balkan

Overview

As part of the SERISS project, we have recently produced new guidelines for the management of ELSST content. These incorporate findings from the first results of the project which used back-translation to evaluate the translation quality of ELSST terms.

The new guidelines are aimed primarily at ELSST translators and content developers, but will also be of interest to end-users. They are divided into six parts, which cover the following topics:

  1. overview and background information on ELSST
  2. the main linguistic elements of ELSST
  3. management structure
  4. construction and maintenance of the thesaurus
  5. the translation process
  6. quality control issues

Part 1 covers the purpose, history ,scope and languages of ELSST, as well as versioning, access and copyright issues. Part 2 defines the various elements of ELSST, i.e. concepts, terms, relationships and notes. Part 3 describes the overall management structure, including the various roles and responsibilities, as well as the communication and decision making processes. Parts 4 and 5 explain how each element of the thesaurus described in Part 2 is constructed and translated. A list of vocabulary resources for the translation process is provided in an appendix. Part 6 discusses quality control issues, and includes a checklist aimed at both source language editors and translators.

The guidelines take account of ISO 25964-1, the latest international standard on thesaurus construction, published by the International Organization for Standardization, as well as the work of other knowledge organisation experts and thesaurus developers. Each part of the guidelines presents best practice for thesauri in general, followed by specific guidelines for ELSST.

Current and future uses

The new guidelines have already been used for training new ELSST translators and for briefing those who are using ELSST to index survey questions as part of ongoing work in the SERISS project. (The SERISS work consists of indexing questions from three cross-national surveys (the European Social Survey (ESS), the Survey of Health, Ageing and Retirement in Europe (SHARE) and the European Values Study (EVS) in three languages (German, Greek and Romanian) using ELSST terms from the relevant languages. The aim is to establish how well the ELSST terms match the content of the questions, and, at the same time, uncover any translation issues, either with the survey questions and/or the ELSST terms.)

The guidelines will also be used in the CESSDA-ELSST follow-up project, the Vocabulary Services Multilingual Content Management (VOICE) project, reported in the CESSDA-ELSST project update blog post. They will inform the new policy and procedural documents that will be required, as well as provide input for new training modules. Additional languages are planned for ELSST during the VOICE project (more details to follow). In order to increase its effectiveness, different modalities for providing translator training will be investigated.

Posted in Uncategorized | Leave a comment

CESSDA ELSST and the Taxonomy Boot Camp, 17-18 October 2017, Olympia London

The second edition of the Taxonomy Boot Camp took place in London on 17 & 18 October (http://www.taxonomybootcamp.com/London/2017/default.aspx). More than 200 participants attended the conference; 21 presentations were offered, most of them as part of parallel sessions. Some notes on what I experienced/attended.

The opening key note “Kick the beehive: new approaches to building taxonomies for the real world” by Madi Weland Solomon guided us into the real world of taxonomies and ontologies, text mining, auto-classification, graph databases and advance semantic technologies. Questions about those important issues and developments were introduced by different musical notes and lead to presents and gifts: a dynamic and entertaining session indeed.

Dave Clarke, CEO Synaptica, presented a best practice guide to how to jump-start your taxonomy project. Indispensable key concepts are the formal data models ISO 25964-1 (2011) and SKOS. Governance and management (create, edit, publish) together with Adopt-adept-create are important cornerstones, as are principles such as semantic integrity, interoperability and collaboration.

The session “Making sense of unstructured and large datasets” focused on search, navigation, discovery based on the notion “saviour machine” with regard to dealing/not dealing with text analytics and machine learning (Ahren Lehnert, Synaptica).  Martin Kaltenböck (Semantic Web) informed us in detail about the UN Climate Technology Centre and Network (CTCN), more specifically about knowledge graph implementation. Useful to know that Semantic web are the organisers of the SEMANTiCS conference (https://2017.semantics.cc/).

Interdisciplinarity, disambiguation, concept clarification, vocabulary alignment and the essential communication between taxonomists, librarians and field specialists formed the interesting issues discussed by Solveig Sørbø and Heidi Konestabo of the Norwegian Science Library at the University of Oslo. The important message taken from Roger Press (Academic Rights Press) was to pass data through several algorithms to derive the ranking order.

I took part in the session on “Working with large multi-faceted and multi-lingual taxonomies” with a presentation on “An in-depth view behind the scenes: the grammar, semantics and management of thesauri” (An email to Jeannine.beeken@essex.ac.uk should suffice to receive a copy of the PPT.)

The other slot was given to Beate Früh, Annette Weilandt and Silvia Giacomotti who talked about “Translation of taxonomies: challenges, methods and synergies”, in other words, about the three Ts: Taxonomy, Terminology and Translation.  Their work for Suva (Swiss National Accident Insurance Fund) involves four languages: German, French, Italian and English.

A lecture by Tom Reamy (KAPS Group) on the ever so important problem of fake news and how to try and unmask it by using text analytics ended the first day.

The second day was opened by Joseph Busch (Taxonomies Strategies & Semantic Staffing) with a very interesting keynote on “AI vs. automation: the newest technologies for automatic tagging”.  Busch addressed the issues of complete and consistent metadata (a computer performs better than a human indexer: 80% – 70%). Cloud computing, in other words “buying space”, should be encouraged as should automated tagging.  The latter deals with entity extraction (NER), keyword extraction, categorizers trained by examples, summarization (identifying key sentences). The open source infrastructure/software GATE/ANNIE (services.gate.ac.uk/annie) was recommended to process human language.  AI on the other hand is characterized by trained/statistical categorizers, such as IBM Watson NLP, Intellexer, Lexalytics. Certainly worth a much closer look and investigation.

In the next session “Collaborative working for website navigation project” Emma Maxim (Government Digital Service, GDS, UK) talked about revisiting the common UI offering an amalgamation of 4 or 5 different taxonomies at the same time (search, navigation, filters, types etc.) and replacing it by one taxonomy based on themes only on the 1st level and presenting the information in the form of concentric circles of the next levels. She also guided the audience through a best practice for user research: discovery (information gathering with large groups of users), validation (agile research sprints), beta-testing (set up success criteria before testing!) and dissemination. She recommended to use/develop machine learning algorithms to get a testable taxonomy. (More information on GDS developments can be found at https://gds.blog.gov.uk.).

“Semantic models in action” saw a presentation by Julia Barrott and Sukaina Bharwani (Stockholm Environment Institute) on “Visualising a harmonised language for climate services and disaster risk reduction”. Their main focus was on the visualisation of knowledge and knowledge discovery. Each term/concept/variable is identified by a tag/identity card/profile containing definitions, glossaries etc., which is clickable in tree structures.

Sabrina Wilske (Kantar TNS) discussed taxonomies for tweets, addressing the implementation of machine assisted algorithms and the linguistic problem of finding identifiers in tweets for abstract concepts (e.g. dehydration).

Veronique Malaisé (Elsevier), “From vocabulary requirements to a SKOS-XL model at Elsevier”, focussed on the RDF triples, comparing OWL, SKOS and SKOS-XL. While OWL defines specific properties for predicates, for example transitivity, in terms of “isa” relationships, SKOS expands its predicate options with for example meronymy. Using SHACKL for automatic checks is a plus, but there are no possibilities for inferences, adding a predicate to the labels themselves (for example acronyms).  In other words, all information is attached to the concepts. SKOS-XL however also allows information to be added to the labels and identifying inter-label relationships.

Cathy Dolbear (OUP) presented on “Automating the categorisation of academic content to a subject taxonomy” at the session “Taxonomy evaluation and maintenance”.  Her taxonomy contains 1500 nodes (SKOS-XML), 6 layers and is based on/used for manual indexing, allowing polyhierarchies. The next step is to classify at chapter and article level, improving the discoverability of the topic/about field. The corpus contains 396 journals, 30.624 books, 7.5 million items at chapter/article level.  Concerning retrieval, Dolbear’s advice was to place precision above recall. Tools such as PoolParty, Protégé and Scikit-learn are used.

The plenary session on “Language is rarely neutral: why the ethics of taxonomies matter”, led by Stella Dextre Clarke, brought some interesting issues to the surface with respect to fake news and gender neutrality/expansion, in other words, the importance of how we classify facts, things and people.

Posted in Conferences/workshops | Leave a comment

CESSDA-ELSST project update

The Thesaurus Team

The CESSDA-ELSST project, which has funded the development of the ELSST and HASSET thesauri over the last five years, officially ended on 30 September 2017.

The project, funded by the UK’s Economic and Social Research Council (ESRC), had three main aims:

  • merging and improving the (internal) management interface of both thesauri
  • updating and improving the (external) user interface of both thesauri
  • reviewing and updating the thesauri’s content

All three goals have been achieved.

New thesaurus management system

The new thesaurus management system, which brought the two thesauri onto the same development platform, was launched in January 2015. Since then, work has focused on streamlining the workflow for concept management and improving reporting functions.

Two new separate, but visually similar, user interfaces were also launched in January 2015. They enable the user to access HASSET and ELSST at https://hasset.ukdataservice.ac.uk/ and https://elsst.ukdataservice.ac.uk/ respectively. An innovative feature of both is the interactive visual graph view which provides an alternative way of navigating thesaurus terms and relationships rather than the standard form view.

Both the thesaurus management system and the user interfaces have undergone two stages of development, in response to user feedback, and the latest versions are due for release later this year.

Updated content

Work on developing the content of HASSET and ELSST has progressed in full collaboration with ELSST translators. Communication is conducted via a dedicated email list and quarterly online meetings. The input from all who have contributed has greatly enhanced the status of ELSST as a multilingual resource and the level of commitment shown by everyone is testament to the value placed on ELSST by both the individuals concerned and their institutions.

We are pleased to report that during the lifetime of the project, four new languages have been added to ELSST: Czech, Lithuanian and Romanian (in 2015) and Slovenian (in 2017). In addition, two new archives, the Luxembourg Institute of Socio-Economic Research (LISER), and the Austrian Social Science Data Archive (AuSSDA) have joined us to contribute to the translation of French and German respectively.

The project also achieved its goal of an annual release of ELSST through two releases (in September 2016 and September 2017).

Future funding

We are happy to announce that further funding has been secured for ELSST from January 2018 – December 2018 via the CESSDA Vocabulary Services Multilingual Content Management (VOICE) work plan. This work will involve the continuation of content development and extension to more languages (details to be confirmed). The UK Data Service will continue to lead the work, with FSD and GESIS as lead partners. The UK Data Service and many ELSST partners are also involved in other related CESSDA projects, including the Euro Question Bank (EQB), the CESSDA Metadata Management (CMM) project, and the Controlled Vocabularies (CV) Manager project. Further information about all these projects can be found at https://www.cessda.eu/Projects/Work-Plans

In the meantime, we would like to thank all those who helped make CESSDA-ELSST a success, and we look forward to working with you all in the next phase of ELSST’s development.

Posted in Uncategorized | Leave a comment

Translating ELSST into Slovenian

Sonja Bezjak and Irena Bolko

The challenge

The Slovenian Social Science Data Archives (ADP) are keen supporters of CESSDA and the cross-national harmonization of archives. We firmly believe that translating the ELSST thesaurus is an important step towards achieving this goal. However, as a small team, we lack the necessary resources to fully engage in the translation project.

In the past few years ADP has been liaising with the Slovenian Common Language Resources and Technology Infrastructure (CLARIN), sharing our knowledge and experience. For ADP, using digital language technologies offered a promising way to reduce the time and effort of the translation process

Automatic translation

Translating ELSST into Slovenian was carried out as a joint project consisting of two steps: automatic translation undertaken by a team of language technology experts, followed by manual editing of the translation by ADP with the support of terminology experts from the relevant subject domains.

In the first phase, the expert team selected and prepared several translation sources. The linguistic expert chose more general translation resources while ADP proposed subject dictionaries. Before translating, all terms and translations from the various translation sources were converted to upper case (as required by ELSST) and all plural-form ELSST source language terms were changed to singular to match the form in the translation sources.

Next, each whole English term was looked up in every translation source, and the results collated. Often the same translation was found in multiple sources. If no translations of the whole term were found, translations were constructed. English terms were subdivided and each subpart was translated independently. The translations of the subparts were then combined to produce a final Slovenian translation of the source term.

Manual editing

In the second phase, ADP team performed a manual check of the automatic translations, verifying and editing them if needed. This phase was subdivided into five tasks:

  • Choosing the best option among the translations produced from the various sources (as a result of the automatic translation)
  • Checking and highlighting the terms with potentially problematic translations (e.g. no appropriate translation, multiple options)
  • Checking the translations where issues were detected and seeking advice from subject experts
  • Consulting the linguistic expert on Slovenian grammar rules
  • Confirming the final list of translations

This was the first time that ELSST translation has been undertaken using semi-automatic translation. We believe that this allowed us not only to produce appropriate Slovenian translations but also to reduce our workload.

Further information

A more thorough explanation of the process described above and the algorithms used is beyond the scope of this blog post. However, should you be interested in reading more we are happy to hear from you and provide you with additional information. You can contact us either by replying to this blog post or by sending an email to
arhiv.podatkov@fdv.uni-lj.si.

See also CESSDA ELSST New Release 21 September 2017

 

Posted in Uncategorized | Leave a comment

CESSDA ELSST New Release 21 September 2017

Jeannine Beeken

We are pleased to announce a new release of The European Language Social Science Thesaurus (ELSST). ELSST is now available in 13 languages: Czech, Danish, English, Finnish, French, German, Greek, Lithuanian, Norwegian, Romanian, Slovenian, Spanish and Swedish. The thesaurus offers information about almost 40.000 preferred terms and contains more than 33.000 non-preferred terms.

Since the previous release in 2016 a considerable amount of changes and improvements have been made throughout all languages. Details on the types of changes can be found at Changes to ELSST. The average percentage of translated preferred terms is 99%, with the majority of languages having a 100% coverage.

We are very proud to announce that Slovenian has been added as our 13th language. The Slovenian translations of all preferred terms have been provided by our colleagues at the Slovenian Social Science Data Archives (ADP). See also https://elsst.wordpress.com/2017/09/22/translating-elsst-into-slovenian/

Finally, we are pleased to announce that our colleagues at the Luxembourg Institute of Socio-Economic Research (LISER) have joined us to contribute to the French translations; our colleagues at the Austrian Social Science Data Archive (AuSSDA) have joined us to contribute to the German translation.

The current version of ELSST was released on 21 September 2017. The previous version dates from 6 September 2016.

Posted in Uncategorized | 1 Comment

Notes from symposium on language technology and translation

Lorna Balkan

Translation tools and technologies have gained increasing importance in the translation sector over the years, but until now have been little applied to the specific field of survey translation. To rectify this, the SERISS project held a symposium on “synergies between survey translation and developments in language and translation sciences” at University Pompeu Fabra (UPF) in Barcelona from 1-2 June 2017. The meeting was attended by delegates from the three main surveys involved in the SERISS project (the European Social Survey (ESS), Survey of Health, Ageing and Retirement in Europe (SHARE ERIC), and the European Values Study (EVS)), as well as by representatives of translation technology companies (cApStAn, Kantar Public and CentERdata) and by a leading academic in automated translation, Professor Toni Badia of UDF.

I was invited to report on my recent work on evaluating ELSST translation quality (see First results of SERISS project) and to consider prospects for automating the translation and evaluation of thesaurus terms.

Two general recommendations emerged from the symposium that are relevant to the translation of ELSST, as well as to survey translation.

First, it is important to analyse the whole life cycle of a product (not just the translation process) and to understand all the steps involved. Action taken prior to the translation phase has an impact on the translation process. We know this in ELSST, which is why we try, when creating new concepts, to choose source language labels and scope notes that will not present problems to the translators.

Second, it is critical to identify which steps a machine can perform better than humans. In the case of survey translation, this includes recognising questions that have not changed since the last wave of a survey, and which thus do not need to be retranslated. It also includes consistency checking in the quality assessment phase. Consistency checking would also be useful in ELSST, to make sure that source language terms that appear within other terms are given the same translation in the corresponding target language terms, where appropriate.

Delegates agreed that translation memories would be helpful to survey translation, since they store previously translated text which can be reused.The SERISS project is currently using CentERdata’s Translation Management Tool (TMT) for managing translation of its questionnaires. It is not particularly relevant to ELSST right now, but we shall see how it develops in the course of the project. The plan is to integrate it with a translation memory in the near future.

Toni Badia proposed that machine translation was mature enough to be able to offer a first draft of survey questions, if human resources were not available. He mentioned that a good starting point was phrase-based statistical machine translation systems, such as Moses, but noted that the paradigm is shifting towards neural machine translation which promises better quality. This is certainly something we could also consider for ELSST.

Another recommendation of the symposium was for translators of the different surveys to collaborate with each other. Each has a list of well-known translation problems. These problems would interest ELSST developers also, so we shall ask to be included in any future collaboration.

Posted in Uncategorized | Leave a comment