Archivismi: archiviamo Cassandra, parte terza

Logo di Feddit Logo di Flarum Logo di Signal Logo di WhatsApp Logo di Telegram Logo di Matrix Logo di XMPP Logo di Discord

Archivismi: we archive Cassandra, part three

Warning: This post was created 6 months does

This is a text automatically translated from Italian. If you appreciate our work and if you like reading it in your language, consider a donation to allow us to continue doing it and improving it.

The articles of Cassandra Crossing I'm under license CC BY-SA 4.0 | Cassandra Crossing is a column created by Marco Calamari with the "nom de plume" of Cassandra, born in 2005.

And here we are at the third and final part of archivisimi Cassandra (but there is still an article missing to conclude Archivisimi!).

This article was written on January 5, 2024 from Cassandra

Cassandra Crossing 566/ Archivismi: we archive Cassandra, part three

It's time to conclude; mass uploading of Cassandra Crossing starts!

In previous episodes of Archivists we explained how archiving works, in broad terms"true” on the Internet Archive. “True” because it is not a question of uploading a directory of files, but of creating real archival objects, complete with all the files and metadata necessary to define the object, and make it useful and usable. And metadata, believe it or not, is by far the hardest and most useful thing.

So, first of all, to archive our favorite column, it was necessary to ask ourselves What archive, in addition to the classic PDF. The choice was to add an HTML file within the content and a file in MARKDOWN format, the latter useful for any further processing that might be necessary. Some articles also talked about books or free publications, and in these few cases the PDF of the publication was also included in the subject.

Well, having said that, it was necessary to create them, these blessed 1686 files. The markdown, html and pdf files were generated completely automatically starting from the html files of the articles exported from Medium.com, thanks to the tools prepared in the previous episodes which were ready to use, processing the input data exported from Medium.com . Everything simple, then?

Obviously not. In these travel notes, your favorite prophetess will tell you about the further vicissitudes she encountered on her journey.

One: Data from Medium.com still contained errors. The most common and most painful type was the incorrect construction of the file name, created by automatically detecting the item number. This is for two main reasons. The first is that some items were simply numbered incorrectly. The second is that the files contained the article number, but not only in the text, but also in the header automatically created by Medium.com. Header that once created was no longer updated; guess where the item number came from?

Two: creating the spreadsheet, having the files well created and renamed was simple. Having kept each upload run in a new sheet was very useful for locating errors and retracing your steps. Also keep the execution log of ia It was very useful for extracting errors.

Threefixing in some cases the numbering of the articles lost the correspondence between the file name and the object identifier. In fact, while files and metadata can be modified, added and deleted, it is not possible to modify the object identifier once created. And when you launch the file generation procedure again, if the numbering changes, some file names also change. To generate the subsequent sheets for loading it was necessary to take this into account and carry out exhaustive checks alignment between identifiers and file names. Of course, the temptation to correct everything and start the procedures all over again was strong. But total automation is not the end, but only a means. Save time, doing things right anyway, it's the real goal.

Four: the first bulk upload of the PDF file only was done for 10 objects. We then waited for the various automatic alchemies of the Internet Archive to be completed, and the result was carefully examined. At the metadata level this led to changing the choices to make them more useful.

Five: The remaining 552 PDFs were then bulk uploaded, thus creating all the objects. The objects, and in particular the identifiers, have never changed in all the subsequent operations we have done. During this first real bulk upload, error messages were generated failure to create, because the current operation had been identified as spam, like this one

uploading error 186_Cassandra-Crossing — L-Internet-senza-Rete.pdf: Please reduce your request rate. — Your upload of 186_Cassandra-Crossing — L-Internet-senza-Rete from username pippo@pluto.paperino appears to be spam. If you believe this is a mistake, contact info@archive.org and include this entire message in your email.

No sooner said than done, I contacted the help desk via email, perhaps because I am a long-time user as well donor regular, in a few hours it removed some obvious anti-spam limitations. Subsequent insertions no longer caused any problems.

Six: Two additional separate bulk uploads were performed, one for the markdown files and one for the html. Only two columns were needed in the spreadsheets; identifier and file. The metadata was assigned when the object was created, therefore the first bulk upload. If they were to be changed en masse, it will be necessary to carry out "bulk correction”.

Seven: the bulk metadata was edited, inserting the description (taken from the subtitle) and the publication date. Both of these data columns were generated with a modified version of the procedure already seen, starting from the markdown files, extracting the field with a regular expression, adding, cleaning and correcting the missing or incorrect fields by hand, and then copying the right ranges into the spreadsheet for bulk upload. Despite the “standardizations” of the previous phases of editing and manipulating the article files, it took more than half a day to resolve the discrepancies.

Eight: And it took a few more hours to examine the list of articles sorted by date on the Internet Archive site and see that there was what needed to be inside. Here too some small errors emerged, but only in terms of date. Only in one case were the titles and dates both reversed, but fortunately these are also metadata, and therefore easily correctable. But it was also a satisfaction to retrace twenty years of work in just a few hours!

And that's all for today too, because the revision work is really tiring. We reserve the conclusions and comments for the next and final episode of this first campaign of "Archivists”.

Marco Calamari

Write to Cassandra — Twitter — Mastodon
Video column “A chat with Cassandra”
Cassandra's Slog (Static Blog).
Cassandra's archive: school, training and thought

This tag @loyal alternatives is used to automatically send this post to Feddit and allow anyone on the fediverse to comment on it.

Join communities

Logo di Feddit Logo di Flarum Logo di Signal Logo di WhatsApp Logo di Telegram Logo di Matrix Logo di XMPP Logo di Discord




If you have found errors in the article you can report them by clicking here, Thank you!