The second edition of the Taxonomy Boot Camp took place in London on 17 & 18 October (http://www.taxonomybootcamp.com/London/2017/default.aspx). More than 200 participants attended the conference; 21 presentations were offered, most of them as part of parallel sessions. Some notes on what I experienced/attended.
The opening key note “Kick the beehive: new approaches to building taxonomies for the real world” by Madi Weland Solomon guided us into the real world of taxonomies and ontologies, text mining, auto-classification, graph databases and advance semantic technologies. Questions about those important issues and developments were introduced by different musical notes and lead to presents and gifts: a dynamic and entertaining session indeed.
Dave Clarke, CEO Synaptica, presented a best practice guide to how to jump-start your taxonomy project. Indispensable key concepts are the formal data models ISO 25964-1 (2011) and SKOS. Governance and management (create, edit, publish) together with Adopt-adept-create are important cornerstones, as are principles such as semantic integrity, interoperability and collaboration.
The session “Making sense of unstructured and large datasets” focused on search, navigation, discovery based on the notion “saviour machine” with regard to dealing/not dealing with text analytics and machine learning (Ahren Lehnert, Synaptica). Martin Kaltenböck (Semantic Web) informed us in detail about the UN Climate Technology Centre and Network (CTCN), more specifically about knowledge graph implementation. Useful to know that Semantic web are the organisers of the SEMANTiCS conference (https://2017.semantics.cc/).
Interdisciplinarity, disambiguation, concept clarification, vocabulary alignment and the essential communication between taxonomists, librarians and field specialists formed the interesting issues discussed by Solveig Sørbø and Heidi Konestabo of the Norwegian Science Library at the University of Oslo. The important message taken from Roger Press (Academic Rights Press) was to pass data through several algorithms to derive the ranking order.
I took part in the session on “Working with large multi-faceted and multi-lingual taxonomies” with a presentation on “An in-depth view behind the scenes: the grammar, semantics and management of thesauri” (An email to Jeannine.firstname.lastname@example.org should suffice to receive a copy of the PPT.)
The other slot was given to Beate Früh, Annette Weilandt and Silvia Giacomotti who talked about “Translation of taxonomies: challenges, methods and synergies”, in other words, about the three Ts: Taxonomy, Terminology and Translation. Their work for Suva (Swiss National Accident Insurance Fund) involves four languages: German, French, Italian and English.
A lecture by Tom Reamy (KAPS Group) on the ever so important problem of fake news and how to try and unmask it by using text analytics ended the first day.
The second day was opened by Joseph Busch (Taxonomies Strategies & Semantic Staffing) with a very interesting keynote on “AI vs. automation: the newest technologies for automatic tagging”. Busch addressed the issues of complete and consistent metadata (a computer performs better than a human indexer: 80% – 70%). Cloud computing, in other words “buying space”, should be encouraged as should automated tagging. The latter deals with entity extraction (NER), keyword extraction, categorizers trained by examples, summarization (identifying key sentences). The open source infrastructure/software GATE/ANNIE (services.gate.ac.uk/annie) was recommended to process human language. AI on the other hand is characterized by trained/statistical categorizers, such as IBM Watson NLP, Intellexer, Lexalytics. Certainly worth a much closer look and investigation.
In the next session “Collaborative working for website navigation project” Emma Maxim (Government Digital Service, GDS, UK) talked about revisiting the common UI offering an amalgamation of 4 or 5 different taxonomies at the same time (search, navigation, filters, types etc.) and replacing it by one taxonomy based on themes only on the 1st level and presenting the information in the form of concentric circles of the next levels. She also guided the audience through a best practice for user research: discovery (information gathering with large groups of users), validation (agile research sprints), beta-testing (set up success criteria before testing!) and dissemination. She recommended to use/develop machine learning algorithms to get a testable taxonomy. (More information on GDS developments can be found at https://gds.blog.gov.uk.).
“Semantic models in action” saw a presentation by Julia Barrott and Sukaina Bharwani (Stockholm Environment Institute) on “Visualising a harmonised language for climate services and disaster risk reduction”. Their main focus was on the visualisation of knowledge and knowledge discovery. Each term/concept/variable is identified by a tag/identity card/profile containing definitions, glossaries etc., which is clickable in tree structures.
Sabrina Wilske (Kantar TNS) discussed taxonomies for tweets, addressing the implementation of machine assisted algorithms and the linguistic problem of finding identifiers in tweets for abstract concepts (e.g. dehydration).
Veronique Malaisé (Elsevier), “From vocabulary requirements to a SKOS-XL model at Elsevier”, focussed on the RDF triples, comparing OWL, SKOS and SKOS-XL. While OWL defines specific properties for predicates, for example transitivity, in terms of “isa” relationships, SKOS expands its predicate options with for example meronymy. Using SHACKL for automatic checks is a plus, but there are no possibilities for inferences, adding a predicate to the labels themselves (for example acronyms). In other words, all information is attached to the concepts. SKOS-XL however also allows information to be added to the labels and identifying inter-label relationships.
Cathy Dolbear (OUP) presented on “Automating the categorisation of academic content to a subject taxonomy” at the session “Taxonomy evaluation and maintenance”. Her taxonomy contains 1500 nodes (SKOS-XML), 6 layers and is based on/used for manual indexing, allowing polyhierarchies. The next step is to classify at chapter and article level, improving the discoverability of the topic/about field. The corpus contains 396 journals, 30.624 books, 7.5 million items at chapter/article level. Concerning retrieval, Dolbear’s advice was to place precision above recall. Tools such as PoolParty, Protégé and Scikit-learn are used.
The plenary session on “Language is rarely neutral: why the ethics of taxonomies matter”, led by Stella Dextre Clarke, brought some interesting issues to the surface with respect to fake news and gender neutrality/expansion, in other words, the importance of how we classify facts, things and people.