John McCrae

Abstract

With the recent adoption of deep learning architectures by artificial intelligence practitioners in areas such as natural language processing, it has become increasingly important that large amounts of data are available to train these AI systems. However, at the same time, we are seeing digital obsolescence at an unprecedented scale meaning that datasets are disappearing or becoming unavailable or unusable almost as quickly as new datasets are being made. The Prêt-à-LLOD project, consisting of 10 European partners from industry and academia, has been tackling this challenge in the context of linguistic data through the development of new technologies built on the linked open data stack. We have developed novel platforms for ensuring data remains documented through the LingHub platform and developed technologies to ensure access to the data through a distributed, content-addressing technology called interplanetary LOD (iLOD). Further, we have been developing novel linking and integration technologies to ensure that data remains documented and usable, recognizing that digital obsolescence is not only due to the unavailability of resources but also the ability of tools to process legacy data formats. It is envisioned that these technologies can help develop open, sustainable science, not just for linguistics and artificial intelligence but also expanding to other disciplines.

Sustainable Interconnected Data for Artificial Intelligence - John McCrae.pdf

2.13 MB