CESSDA-ELSST project update

The Thesaurus Team

The CESSDA-ELSST project, which has funded the development of the ELSST and HASSET thesauri over the last five years, officially ended on 30 September 2017.

The project, funded by the UK’s Economic and Social Research Council (ESRC), had three main aims:

  • merging and improving the (internal) management interface of both thesauri
  • updating and improving the (external) user interface of both thesauri
  • reviewing and updating the thesauri’s content

All three goals have been achieved.

New thesaurus management system

The new thesaurus management system, which brought the two thesauri onto the same development platform, was launched in January 2015. Since then, work has focused on streamlining the workflow for concept management and improving reporting functions.

Two new separate, but visually similar, user interfaces were also launched in January 2015. They enable the user to access HASSET and ELSST at https://hasset.ukdataservice.ac.uk/ and https://elsst.ukdataservice.ac.uk/ respectively. An innovative feature of both is the interactive visual graph view which provides an alternative way of navigating thesaurus terms and relationships rather than the standard form view.

Both the thesaurus management system and the user interfaces have undergone two stages of development, in response to user feedback, and the latest versions are due for release later this year.

Updated content

Work on developing the content of HASSET and ELSST has progressed in full collaboration with ELSST translators. Communication is conducted via a dedicated email list and quarterly online meetings. The input from all who have contributed has greatly enhanced the status of ELSST as a multilingual resource and the level of commitment shown by everyone is testament to the value placed on ELSST by both the individuals concerned and their institutions.

We are pleased to report that during the lifetime of the project, four new languages have been added to ELSST: Czech, Lithuanian and Romanian (in 2015) and Slovenian (in 2017). In addition, two new archives, the Luxembourg Institute of Socio-Economic Research (LISER), and the Austrian Social Science Data Archive (AuSSDA) have joined us to contribute to the translation of French and German respectively.

The project also achieved its goal of an annual release of ELSST through two releases (in September 2016 and September 2017).

Future funding

We are happy to announce that further funding has been secured for ELSST from January 2018 – December 2018 via the CESSDA Vocabulary Services Multilingual Content Maintenance work plan. This work will involve the continuation of content development and extension to more languages (details to be confirmed). The UK Data Service will continue to lead the work, with FSD and GESIS as lead partners. The UK Data Service and many ELSST partners are also involved in other related CESSDA projects, including the Euro Question Bank (EQB), the CESSDA Metadata Management (CMM) project, and the Controlled Vocabularies (CV) Manager project. Further information about all these projects can be found at https://www.cessda.eu/Projects/Work-Plans

In the meantime, we would like to thank all those who helped make CESSDA-ELSST a success, and we look forward to working with you all in the next phase of ELSST’s development.

Advertisements
Posted in Uncategorized | Leave a comment

Translating ELSST into Slovenian

Sonja Bezjak and Irena Bolko

The challenge

The Slovenian Social Science Data Archives (ADP) are keen supporters of CESSDA and the cross-national harmonization of archives. We firmly believe that translating the ELSST thesaurus is an important step towards achieving this goal. However, as a small team, we lack the necessary resources to fully engage in the translation project.

In the past few years ADP has been liaising with the Slovenian Common Language Resources and Technology Infrastructure (CLARIN), sharing our knowledge and experience. For ADP, using digital language technologies offered a promising way to reduce the time and effort of the translation process

Automatic translation

Translating ELSST into Slovenian was carried out as a joint project consisting of two steps: automatic translation undertaken by a team of language technology experts, followed by manual editing of the translation by ADP with the support of terminology experts from the relevant subject domains.

In the first phase, the expert team selected and prepared several translation sources. The linguistic expert chose more general translation resources while ADP proposed subject dictionaries. Before translating, all terms and translations from the various translation sources were converted to upper case (as required by ELSST) and all plural-form ELSST source language terms were changed to singular to match the form in the translation sources.

Next, each whole English term was looked up in every translation source, and the results collated. Often the same translation was found in multiple sources. If no translations of the whole term were found, translations were constructed. English terms were subdivided and each subpart was translated independently. The translations of the subparts were then combined to produce a final Slovenian translation of the source term.

Manual editing

In the second phase, ADP team performed a manual check of the automatic translations, verifying and editing them if needed. This phase was subdivided into five tasks:

  • Choosing the best option among the translations produced from the various sources (as a result of the automatic translation)
  • Checking and highlighting the terms with potentially problematic translations (e.g. no appropriate translation, multiple options)
  • Checking the translations where issues were detected and seeking advice from subject experts
  • Consulting the linguistic expert on Slovenian grammar rules
  • Confirming the final list of translations

This was the first time that ELSST translation has been undertaken using semi-automatic translation. We believe that this allowed us not only to produce appropriate Slovenian translations but also to reduce our workload.

Further information

A more thorough explanation of the process described above and the algorithms used is beyond the scope of this blog post. However, should you be interested in reading more we are happy to hear from you and provide you with additional information. You can contact us either by replying to this blog post or by sending an email to
arhiv.podatkov@fdv.uni-lj.si.

See also CESSDA ELSST New Release 21 September 2017

 

Posted in Uncategorized | Leave a comment

CESSDA ELSST New Release 21 September 2017

Jeannine Beeken

We are pleased to announce a new release of The European Language Social Science Thesaurus (ELSST). ELSST is now available in 13 languages: Czech, Danish, English, Finnish, French, German, Greek, Lithuanian, Norwegian, Romanian, Slovenian, Spanish and Swedish. The thesaurus offers information about almost 40.000 preferred terms and contains more than 33.000 non-preferred terms.

Since the previous release in 2016 a considerable amount of changes and improvements have been made throughout all languages. Details on the types of changes can be found at Changes to ELSST. The average percentage of translated preferred terms is 99%, with the majority of languages having a 100% coverage.

We are very proud to announce that Slovenian has been added as our 13th language. The Slovenian translations of all preferred terms have been provided by our colleagues at the Slovenian Social Science Data Archives (ADP). See also https://elsst.wordpress.com/2017/09/22/translating-elsst-into-slovenian/

Finally, we are pleased to announce that our colleagues at the Luxembourg Institute of Socio-Economic Research (LISER) have joined us to contribute to the French translations; our colleagues at the Austrian Social Science Data Archive (AuSSDA) have joined us to contribute to the German translation.

The current version of ELSST was released on 21 September 2017. The previous version dates from 6 September 2016.

Posted in Uncategorized | 1 Comment

Notes from symposium on language technology and translation

Lorna Balkan

Translation tools and technologies have gained increasing importance in the translation sector over the years, but until now have been little applied to the specific field of survey translation. To rectify this, the SERISS project held a symposium on “synergies between survey translation and developments in language and translation sciences” at University Pompeu Fabra (UPF) in Barcelona from 1-2 June 2017. The meeting was attended by delegates from the three main surveys involved in the SERISS project (the European Social Survey (ESS), Survey of Health, Ageing and Retirement in Europe (SHARE ERIC), and the European Values Study (EVS)), as well as by representatives of translation technology companies (cApStAn, Kantar Public and CentERdata) and by a leading academic in automated translation, Professor Toni Badia of UDF.

I was invited to report on my recent work on evaluating ELSST translation quality (see First results of SERISS project) and to consider prospects for automating the translation and evaluation of thesaurus terms.

Two general recommendations emerged from the symposium that are relevant to the translation of ELSST, as well as to survey translation.

First, it is important to analyse the whole life cycle of a product (not just the translation process) and to understand all the steps involved. Action taken prior to the translation phase has an impact on the translation process. We know this in ELSST, which is why we try, when creating new concepts, to choose source language labels and scope notes that will not present problems to the translators.

Second, it is critical to identify which steps a machine can perform better than humans. In the case of survey translation, this includes recognising questions that have not changed since the last wave of a survey, and which thus do not need to be retranslated. It also includes consistency checking in the quality assessment phase. Consistency checking would also be useful in ELSST, to make sure that source language terms that appear within other terms are given the same translation in the corresponding target language terms, where appropriate.

Delegates agreed that translation memories would be helpful to survey translation, since they store previously translated text which can be reused.The SERISS project is currently using CentERdata’s Translation Management Tool (TMT) for managing translation of its questionnaires. It is not particularly relevant to ELSST right now, but we shall see how it develops in the course of the project. The plan is to integrate it with a translation memory in the near future.

Toni Badia proposed that machine translation was mature enough to be able to offer a first draft of survey questions, if human resources were not available. He mentioned that a good starting point was phrase-based statistical machine translation systems, such as Moses, but noted that the paradigm is shifting towards neural machine translation which promises better quality. This is certainly something we could also consider for ELSST.

Another recommendation of the symposium was for translators of the different surveys to collaborate with each other. Each has a list of well-known translation problems. These problems would interest ELSST developers also, so we shall ask to be included in any future collaboration.

Posted in Uncategorized | Leave a comment

Observatory for Knowledge Organisation Systems workshop, Malta, 2017

Suzanne Barbalet

The Observatory for Knowledge Organisation Systems workshop, an event of KNOWeSCAPE – Analyzing the Dynamics of Information and Knowledge Landscapes, met in Malta from 1-3 February this year. KNOWeSCAPE is funded under the European Cooperation in Science and Technology (COST) framework.

Based on a European intergovernmental framework for cooperation in science and technology, COST supports trans-national cooperation across all fields in science and technology, including social sciences and humanities, through pan-European networking of nationally funded research activities. Contributors to this event were indeed representative of such a cross-section of Knowledge Organisation System (KOS) users and developers.

Philipp Mayr from the GESIS department of Knowledge Technologies for the Social Sciences (WTS) reported on case studies using two KOS applications for query expansion for the platforms, Sowiport, and Social Science Open Access Repository (SSOAR) that his team manages. These platforms cover bibliographic information and full text in the social sciences. When the two KOS applications were mapped to one another by subject specialists it was found that the use of more than one KOS tool improved the user experience.

A presentation by Kalpana Shankar reported on a study of two data archives, UK Data Archive, University of Essex and Inter-university Consortium for Political and Social Research (ICPSR), University of Michigan, with the aim of raising larger questions about data sustainability. Still a work in progress, it promises to be an important piece of research on the history of social science data archives and their impact on the social sciences in the latter part of the twentieth century.

An interesting talk by Paul Groth, Elsevier, Netherlands considered how the process of constructing a KOS might be changing with the incorporation of software agents and non-professional contributors and suggested a role for a KOS observatory to engage these issues.

On the second day the workshop concluded with an interesting question and answer session on the collaborative process used in the development of Wikipedia content. A strong thread in the debate that ensued was the role a national Wikipedia may play in the preservation of cultural heritage.

In the first presentation of the cataloguing session Jan Kozlowski challenged us to consider the importance of providing metadata for modern European manuscripts. Two further presentations were on the schema Universal Decimal Classification (UDC). In her presentation Universal Knowledge Classifications: From Linking Information to Linked Data Aida Slavic, stressed the importance of classification to ensure full retrieval for those instances where we cannot afford not to find everything. This argument was similar to that made by Patrick Lambe at the ISKO UK conference 2015 and reported on in ELSST Development and News.

A presentation by Peter Hook from Wayne State University entitled Visualizing Knowledge Organization Systems provided some interesting insights that have application for a project I have been piloting using UDC to manage the content of subject categories or topics.

I am particularly grateful to Andrea Scharnhorst of Data Archiving and Networked Services (DANS), Netherlands and Aida Slavic, editor-in-chief of the UDC Consortium for the invitation to attend.

Posted in Conferences/workshops | Leave a comment

First results of SERISS project

Lorna Balkan

Synergies for Europe’s Research Infrastructures in the Social Sciences (SERISS) is a four-year project (2015-2019) funded by the European Commission as part of its Horizon 2020 programme. It aims to foster collaboration and develop shared standards between the three leading European research infrastructures in the social sciences – the European Social Survey (ESS ERIC), the Survey of Health, Ageing and Retirement in Europe (SHARE ERIC), and the Consortium of European Social Science Data Archives (CESSDA AS) – and organisations representing the Generations and Gender Programme (GGP), European Values Study (EVS) and the WageIndicator Survey. Work focuses on three key areas: addressing key challenges for cross-national data collection, breaking down barriers between social science infrastructures, and embracing the future of the social sciences.

The first results of the project are now available online. These include D3.9: Report on findings from re-translation of ELSST terms and their use in the CESSDA Portal reporting the work done by the UK Data Service. The deliverable describes two methods that were used to assess the translation quality of ELSST terms, and is in two parts.

Back-translation

The first part describes the evaluation of a subset of ELSST French and German terms (1000 from each language) using the re-translation (or more precisely, the back-translation) method. The French and German terms were back-translated into the source language (English) and differences between the back-translations and the original source language terms were then analysed. This resulted in a classification of error types, and a number of recommendations. The deliverable shows that, while the back-translation method was useful in highlighting some issues with the thesaurus that affect both its ‘semantic adequacy’ (i.e. how adequate terms are from a semantics point of view) and its ‘formal adequacy’ (i.e. the extent to which terms conform to ELSST Translation Guidelines), it has nothing to say about ‘pragmatic adequacy’ (i.e. how acceptable terms are to users), or how the terms would function in an operational setting. Back-translation should, therefore, be seen as one of several complementary evaluation methods.

ELSST in use

One such complementary evaluation method is to compare the sets of terms that have been used to index the same resources, to see if differences are due to differences in how the terms have been interpreted, indicating unintended ambiguity in either the source or target terms. This approach is explored in the second part of the deliverable which compares the sets of ELSST terms that have been used to index specific cross-national surveys. The original plan was to use the CESSDA portal to find such studies, but this had to be revised since the portal has not been operational for some time. Instead, CESSDA-ELSST partners were asked via a questionnaire how they index a set of cross-national surveys. Many thanks to all who responded. Differences in the sets of terms assigned to each survey were then analysed. Results showed that, due to the paucity of the data and the differences in indexing practices across archives, it was not possible to draw any firm conclusions on the quality of the translation. However, the work highlighted ways in which ELSST could be better exploited within the archives.

Next steps

The results of the evaluation work described above will feed into ongoing work on HASSET and ELSST within the CESSDA-ELSST project, and into the next goals of the SERISS project. SERISS goals in the next phase (to June 2017) include updating the ELSST translation guidelines and producing the next deliverable: D3.10 ‘Best practice document on translation and use of thesaurus terms’. This work will be produced in consultation with CESSDA-ELSST partners.

Related SERISS work

Complementary work within SERISS is looking at how to improve the translation quality of the questionnaires in cross-national surveys. Different approaches to questionnaire translation are being investigated, including how computational linguistic methods could be exploited. A workshop on this last topic is planned in the near future.

Feedback

If you have any comments on this blog, please either add them below, or send them to the UK Data Service Thesaurus Team at thesaurus@ukdataservice.ac.uk

Posted in Uncategorized | Leave a comment

Showcasing taxonomies and their uses

Lorna Balkan

The first Taxonomy Boot Camp London was held in at the Olympia Conference Centre, London from 18-19 November. It was a chance for experts and novices alike to find out about the latest developments in taxonomies and their uses. It was also a chance to present first findings on the evaluation work I have been doing on ELSST within the context of the SERISS project – see my presentation on ‘Using back-translation for quality control in a multilingual thesaurus’.

The boot camp attracted an international audience from the academic, charity, and corporate sectors. ‘Taxonomy’ was understood in a general sense to cover thesauri, ontologies, and other less formal knowledge organization systems (KOSs). Presentations covered the design and construction of taxonomies, application areas such as search and corporate information/knowledge management, and software tools.

Keynote speakers were Mike Atherton, Content Strategist at Facebook, UK and Patrick Lambe, a well-known taxonomist from Straits Knowledge. Atherton argued that underpinning a website with a domain model makes it easier to maintain and update. Lambe showed how taxonomies can be used to organize corporate knowledge (including processes and procedures) as well as corporate information (data).

Taxonomies in theory

Heather Hedden, the author of The Accidental Taxonomist, gave an interesting presentation on non-preferred terms. She discussed how, in the term-based thesaurus model represented by ISO 25964-1 (and found in ELSST), non-preferred terms stand in an equivalence relationship to their preferred terms, while in the concept-based Simple Knowledge Organization System (SKOS) model, they are attributes of concepts. Stella Dextre Clarke, co-author of ISO 25964-1, pointed out that both standards were developed in close collaboration and are largely compatible. While ISO 25964-1 is built for thesauri, she noted, SKOS is designed to accommodate other KOSs also.

Taxonomies in practice

A practical session entitled ‘working with multidisciplinary teams – taxonomy tales from the trenches’ was led by experienced taxonomists and offered practical tips for how to negotiate with other stakeholders (project managers, IT specialists, etc.) in the construction of a taxonomy. Of particular relevance to ELSST, in the context of the evaluation work being undertaken in the SERISS project, was Lambe’s warning not to present your taxonomy to an ‘expert’ and ask them what they think of it. Experts should, he argued, be consulted about specific questions only.

Tools and technology

Many different software tools for constructing and managing taxonomies were presented and discussed. Common themes included using automatic indexing and/or crowd-sourcing to populate or improve taxonomies. Many speakers also stressed the importance of semantic technologies (SKOS, RDF, and Linked Open Data (LOD) ) which allow mappings to other vocabularies and LOD. For example, Roger Press of Academic Rights Press Ltd, described musicweb, a music portal that exploits LOD to ‘discover’ information about artists that is not otherwise available.

Andreas Blumauer of PoolParty described the advantages of graph databases over traditional databases, including their ability to map to data and documents stored in other systems and databases, and to support more complex queries.

In short, the need for taxonomies, and for people who know how to construct them, looks stronger than ever.

Posted in Conferences/workshops | Leave a comment