Archivismi: archiviamo Cassandra, parte seconda

💙 donations

This is a text automatically translated from Italian. If you appreciate our work and if you like reading it in your language, consider a donation to allow us to continue doing it and improving it.

The articles of Cassandra Crossing I'm under license CC BY-SA 4.0 | Cassandra Crossing is a column created by Marco Calamari with the "nom de plume" of Cassandra, born in 2005.

Second part of archiving Cassandra

This article was written on January 1, 2024 from Cassandra

Cassandra Crossing 565/ Archivismi: we archive Cassandra, part two

After preparing the pdfs there are no more excuses, we have to archive our first article by Cassandra Crossing.

In previous episodes of Archivists we have explained the main features of Internet Archive, and uploaded a simple example document. Subsequently we set ourselves the ambitious goal of uploading thecomplete work of Cassandra, and we have laboriously prepared the necessary material in the most appropriate formats and structure.

There are no more excuses; it's time to start uploading the first Cassandra Crossing document, with all the little things and metadata in the right place!

So we have to really grapple with ia and, since we will have to load hundreds of documents, don't do it directly with the command line, loading one file at a time and writing all the parameters and metadata on a very long command line.

Much better to practice with them right away bulk upload, which are achieved by providing ad ia a single parameter, i.e. the name of a spreadsheet in CSV format, into which we will insert the necessary data (and modify them many times to remedy inevitable errors).

The command to do this is simply

ia upload — spreadsheet=metadata.csv

The real work will be filling the final spreadsheet with thousands of lines of data, but let's take it one step at a time and load just one object, so a three-line file will suffice.

Our first document will contain two files among those generated for archiving, the pdf as the main document and thehtml withincontent as second file; we will also add a minimum wage of metadata, and the identifier will be chosen equal to the name of the files, excluding the extension.

In short, after many, many attempts here is the paper...

It seems easy, but it took half a day of work to get the first satisfactory insertion. Seemingly insignificant but actually diabolical minutiae required a lot of time for proof and counter-proof. I'll tell you some of them here, hoping to save you precious time.

one — when you save a spreadsheet in CSV format, which means “values separated by commas” don't trust your application. In certain cases, here in Italy, the application may decide to use not the comma but the semicolon, and you won't notice it immediately. I swear, it happened!

two — disable all self-correction tools in the application with which you are managing the spreadsheet; otherwise the program will certainly decide to replace something for your good. In my case he decided to replace two consecutive minus signs, present in the file names, with a "long dash”, a practically invisible change, even from the command line. This led to the inexplicable error message file not found, and required a few dozen tests, with related climbing on mirrors. I am not reporting here the words that were spoken when the problem was finally localized!

Three — be very careful when entering values into fields. A single white space before or after the value may not be interpreted, and may have unexpected effects. A space at the beginning of “ test_collection” for example prevented the correct assignment of the object to the test collection, intended, as you already know, to enable automatic deletion after 30 days. Furthermore, consider that it is not possible to explicitly assign the object to public collections such as “opendata”, but you must accept the automatic selection that will be made by the system.

four — insert the column into the sheet mediatype, when the documents are textual (txt, html, pdf, etc.), and use the value, “texts” otherwise the system will automatically assign the value “date” and this will have insidious side effects. For example the Browsers of objects it won't let you browse pages, even though all the necessary derivative files have been created correctly. The mediatype, unlike the vast majority of parameters, can no longer be modified, but it is necessary to delete and regenerate the object.

five — deleting an object is not an instantaneous operation, but requires minutes or tens of minutes before the effect spreads to all parts of the site interface. It's not worth deleting from the command line with ia; it is definitely more practical to do it from the page My Upload. Reload the page often, and if you notice strange things, also try clearing the cache Browsers.

six — the appearance of a newly created object in the window My Upload it is, strangely enough, quite fast, but it triggers all operations”derivative”, which in turn generate the other files in variable but quite long times. This means, for example, that the Browsers of objects will not be able to let you browse the pages before half an hour, and that the internal search functionality at the Browsers of objects it will only be active after several hours.

But, in the end, what a satisfaction...