Semantic Web in Libraries Conference Report
Semantic Web in Libraries Conference Report by Bree Midavaine, Taxonomist
November 25-26, 2019
One of the many tasks the Art Information Commons (AIC) must perform as part of the Mellon grant is to submit recommendations for how to present our art information resources and data as linked open data. As noted in our environmental scan one of the first steps has been to reach out to other institutions who are tackling similar projects and initiatives in order to see how they are implementing linked open data, hear what challenges and successes they encounter, and to get an understanding of how their newly accessible data is being used. For the past couple of months, we have used our environmental scan to direct us toward who to contact. The AIC team has had many conversations with institutions such as the Milwaukee Art Museum, the Art Institute of Chicago, SFMOMA, Boston University, Princeton, and many more. In addition to these conversations, conferences on the semantic web and linked open data, can also be a vital resource for this type of investigation. As a taxonomist who is interested in semantic technologies, the Semantic Web in Libraries (SWIB) conference seemed perfect for this type of information gathering. I was hoping to learn how these institutions were implementing linked open data and how or if they used knowledge graphs. It was also a great opportunity to be introduced to different types of linked open data projects from a variety of institutions, most of them outside of the United States, all in the course of only a couple of days. With that in mind I will share with you the background of the conference, summaries of the presentations, and note which ones were my favorites. It should be noted that the presentation summaries and designation as a “favorite” is based on needs for our institution and initiative, (plus a little of my own personal preference thrown in).
The Semantic Web in Libraries conference, according to the SWIB website, is organized by the Leibniz Information Centre for Economics (ZBW) and the North Rhine-Westphalian Library Service Centre (hbz) and is attended by all those interested in linked open data (LOD) tools and research projects, including researchers, developers, librarians (or taxonomists), and IT staff. This annual conference has been held for the past eleven years alternating between the German cities of Hamburg, Cologne, and Bonn. The website states, “The topics of talks and workshops…revolve around opening data, linking data, creating tools and software for LOD production scenarios. These areas of focus are supplemented by presentations of research projects in applied sciences, industry applications and LOD activities in other areas.” Participants this year were from the National Library of the Netherlands, National Library of Finland, National Library of Spain, Frankfurt University Library, Trinity College, National Library of France (BnF), French Institute for Research in Computer Science and Automation, Max Planck Institute for the History of Science, @Cult/Casalini Libri, The Gottingen Society for Scientific Data Processing (GWDG), Dublin Core Metadata Initiative (DCMI), Online Computer Library Center (OCLC), Kent State University, hbz/graphthinking, and the National Library of Sweden.
Pre-conference Cocoda Workshop
Cocoda is a tool developed by VZG in Germany for the managing and mapping between knowledge organization systems. It is still in the development stage and is currently geared towards German audiences. The workshop was presented by Jakob Voss from GBV and focused on the programming required for Cocoda to work. Unfortunately, there were some snafus, which precluded our ability to work in the tool. We did get an overview of the work that went into creating Cocoda, but I did not get to see its capabilities. There is currently a tutorial on the Cocoda website. The other workshops included one on OpenRefine, subject indexing with Annif, IIIF, and an introduction to Jupyter Notebooks. Slides for these workshops can be found on the program page of the conference website.
Main Conference Takeaways
The two main things I saw repeated over and over throughout the conference were focused around RDF for its inferencing capability and then the wider use of wikibase/mediawiki and especially wikidata as gateways to broader LOD on the semantic web. The point about RDF was nicely put, on the last day of the conference, by Niklas Lindström when he noted that Google, DCAT-AP, even BIBFRAME do not consume OWL (Schema.org), so if we want our data to be inferenced by these applications it needs to be in RDF. Wikidata was highlighted by several institutions, like OCLC and others, for reconciliation and LOD applications.
Day 1: Summaries and Favorites:
Please note some of the wording for these summaries has been taken from the abstracts and slides (program page) provided to SWIB, in order to accurately portray the goals of each presentation.
Publishing Linked Data on Data.Bibliotheken.nl by René Voorburg from the KB national library of the Netherlands focused on data conversion from Pica into SKOS and then into RDF, which is A LOT of work. They also mentioned the need to focus on what we within the AIC team call “data self-care,” which is the idea that there is a need to address and improve the current state of data within an institution before embarking on a large project.
Favorite: 20 million URIs and the overhaul of the Finnish Library sector subject indexing by Matias Frosterus and Jarmo Saarikko from The National Library of Finland was a very interesting presentation on their need to create multilingual interlinked vocabularies. Finland has two main languages Finnish and Swedish-Finnish. Prior to their project the two subject vocabularies weren’t linked together. This caused Finnish speaking user search results to be different than the Swedish-Finnish speaking user search results. The National Library of Finland mapped both sets of subject vocabulary to English and then linked to other Finnish specific ontologies, LCSH, and wikidata. In addition to this, they also decoupled the subject fields in MARC to allow for users to search by each section of the subject string rather than requiring the whole string. These were put into new MARC fields. Being a cataloger at heart, I found all of this fascinating! They also had an interesting communication strategy which could be useful for many large-scale projects. It is interesting to note this was the first of many presentations where the institutions relied on wikidata to help map vocabularies, which then provided a gateway into linked open data.
In and out: workflows between library data and linked-data at the National Library of Spain by Ricardo Santos drew the analogy that in the library world MARC is like creating floor plans in a 2D world and moving to FRBR and Linked Open Data is opening up to designing spaces in a 3D world. The National Library of Spain like many others has been linking and reconciling entities by using a combination of VIAF, DBpedia, Library of Congress authorities, and wikidata. While wikidata did need some oversight and quality control it was, for the most part, good data. This has proven to be a good way to make catalogs to be built both for public use and for enhanced discovery from search engines.
From raw data to rich(er) data: lessons learned while aggregating metadata by Julia Beck from Frankfurt University Library was an analysis of their data workflows in the performing art field, which goes through four steps: (1) Analysis and documentation of data to be ingested, (2) preprocessing of the raw data for more interoperability, (3) modeling and transforming of this data with title and authority data, and finally, (4) interlinking and enrichment of the entities. They did note that this current workflow has one drawback, it does not allow them to take this transformed data and give it back to the institutions who provided it. The transformation no longer makes it compatible with the source systems.
Favorite: NAISC: an authoritative Linked Data interlinking approach for the library domain by Lucy McKenna from the ADAPT Centre at Trinity College presented on her GUI (Graphic User Interface) which is built to allow users to interact with their data, enhance the metadata, and create linked open data upon sharing. While this tool is not currently available as an open source tool, she is hopeful that once she finishes her doctorate it will be made available. According to the field tests she conducted, users found it easy to navigate once familiar with this metadata enhancement tool. Those tests could be replicated in other institutions.
Cool and the BnF gang: some thoughts at the Bibliothèque nationale de France about handling persistent identifiers by Raphaëlle Lapôtre from the Bibliothèque nationale de France made some interesting points about the issues surrounding making persistent URI identifiers, as well as presented best practices for URI creation. Key takeaway is that we should be creating uniform URI namespaces and best practices from the very beginning of LOD projects.
Day 2: Summaries and Favorites:
Favorite: Smart Data for Digital Humanities by Marcia Zeng at Kent State University. I was looking forward to this talk because I am familiar with her work from research I did while I was in graduate school. Her main point was to say our data must be cleaned, transformed, and refined in order to generate real value. Knowledge graphs are the best thing we can use to transform our data from just big data into smart data. The primary challenge is getting all of the data we need and that the most useful data is not the stuff that can be scraped from the web, but it resides in our institutions. The best way to get at this data is by annotation, data mining, indexing, linking, and using simple AI technologies like natural language processing and text analysis. The goal of smart data is to reveal what she called unknown, unknowns—discovery becomes about the non-obvious. She discussed the Sampo model, which is a semantic model based on a shared ontology infrastructure in the middle, it uses mutually aligned metadata and shared domain ontologies to create smarter data.
Favorite: Digital sources and research data: linked and usable by Florian Kräutli and Esther Chen from the Max Planck Institute for the History of Science presented on their Digital Research Infrastructure for the Humanities. This model is using a repository for their digital artefacts such as sources, annotations, and databases coupled with a knowledge graph. These are accessed through working environments which are used by researchers to enhance the metadata. Their goal was to make research outputs usable and accessible for the long term by adopting a common model to represent their digital knowledge and then implement linked open data technologies. The architecture uses CIDOC-CRM, ResearchSpace, Metaphactory and custom working environments. Their slides showed off their architecture build and could be helpful to others looking to build the same thing.
Data modeling in and beyond BIBFRAME by Tiziana Possemato from @Cult and the Casalini Libri showed another project looking to enhance and reconcile their library metadata, while using BIBFRAME.
Empirical evaluation of library catalogues by Péter Király at GWDG concludes that for better searchability MARC subjects should be broken down or split apart.
Design for simple application profiles by Karen Coyle and Tom Baker from Dublin Core Metadata Initiative presented a simpler simple application profile for users who are unfamiliar with profiles and are new to data in general.
SkoHub: KOS-based content syndication with ActivityPub by Adrian Pohl and Felix Ostrowski from hbz and graphthinking was a presentation which I need to review further, as it was the most difficult for me to apply to the Art Information Commons initiative. It is a project that is looking to create a tool which will automate notifications on specific subjects set by the institutions which use it. This would allow the tool to send them notifications of relevant content from the repositories specific to the subject profiles they set up.
Proposing rich views of linked open data sets: the S-paths prototype and the visualization of FRBRized data in data.bnf.fr by Raphaëlle Lapôtre, Marie Destandau, and Emmanuel Pietriga from the Bibliothèque nationale de France presented on data visualization, which can provide meaningful and more readable insights into a dataset.
Target Vocabulary Maps by Niklas Lindström from the National Library of Sweden focused on the need to choose to map to the vocabularies that best suit your institution. In moving from MARC21 to BIBFRAME and Linked Open Data the National Library of Sweden created a predefined set of property and classes to which all other vocabularies should be mapped.
Favorite: Lessons from representing library metadata in OCLC research’s Linked Data Wikibase prototype by Karen Smith-Yoshimura from OCLC was a fascinating presentation on OCLC’s use of Wikibase for their Project Passage. It was an excellent walk through of their process of creating the prototype and what they hoped to achieve for their various use cases. They too took advantage of knowledge graphs working with metadata enhancement tools and the need for interoperability between data sources to combat the issues Julia Beck discussed in her presentation the day before.
If you are interested in learning more this program link includes an abstract, slides, and video for each presentation. SWIB has a YouTube channel with all the video from the conference, as well as from other events and previous conferences. Please contact Bree Midavaine at firstname.lastname@example.org if you have any questions. If you are requesting information about a SWIB presentation, please include the title of the presentation in your email. For more information about our environmental scan please see our white paper, “Setting the Scene.” The scan can be found on the Resources page of the Art Information Commons website.
The Art Information Commons at the Philadelphia Museum of Art has been made possible by the Mellon Foundation.