Backup

From geokb
Revision as of 18:46, 16 November 2023 by Sky (talk | contribs) (Initial discussion on backup strategy)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

From USGS, we are excited to use the wikibase.cloud platform as a rallying point to organize a knowledge representation across our scientific domains. We are working to align the GeoKB with both formal ontologies and their intersections with Wikidata properties, classification, and instances. Part of the goal is to make our scientific knowledge more translatable into other domains of use and interest.

At the same time, we also need to ensure as much future capability from this resource as possible. As a U.S. Government organization, we adhere to a government policy issued by the Office of Management and Budget in 2013 that encouraged U.S. Agencies to leverage third party, cloud-based resources where they would advance the mission while also ensuring availability of an alternative route to the same content on official ".gov" resources.

All of the material we are organizing into the GeoKB knowledge representation comes from existing public sources, and we are working through conventions and methods in our Wikibase implementation to clearly show those linkages, including through processing codes when applicable. At the same time, the specific representation of this material through the GeoKB Wikibase instance is a unique derivative from original source, often instantiating explicit relationships via ontology mapping that might otherwise be only implied in the original. We are also taking advantage of the many capabilities afforded through this technology for multi-modal creation and editing of properties (P identifiers) and entities (Q identifiers) to include code-based inputs (recorded as bot contributions) and user-based inputs (tied to registered user accounts). In addition, we use property and entity discussion pages along with other wiki pages in various ways to store additional content not otherwise organized into the knowledge graph itself and record some of our reasoning and documentation on the knowledgebase.

All of this dynamic of using this platform as a living knowledgebase means that we need to backup and store the content on a U.S. Government platform to both future-proof the functionality and to ensure that we can, as necessary, provide an alternative environment on a .gov host. Fortunately, the Mediawiki team and contributors have developed Mediawiki Dump Generator, a Python toolset for generating a full and complete backup of a Mediawiki instance, including a full history of changes. A dump is a reasonably lightweight dataset in XML that can be loaded into a clean Wikibase instance or read and transformed into some other structure as needed.

In the near term, a process to keep a dump of the full GeoKB stored on the USGS cloud using a background process run as a Lambda suffices to give us a safeguard and starting point for an alternative representation if we should need to stand something up. We may also look at a more curated process that focuses solely on the current state of content into an alternative graph data store if we surface the need for that capability to serve other purposes. This would use either the SPARQL service or Wikimedia API to sync the GeoKB with one or more alternate graph databases.