Archivismi: API, quando il gioco si fa duro

💙 donations

This is a text automatically translated from Italian. If you appreciate our work and if you like reading it in your language, consider a donation to allow us to continue doing it and improving it.

The articles of Cassandra Crossing I'm under license CC BY-SA 4.0 | Cassandra Crossing is a column created by Marco Calamari with the "nom de plume" of Cassandra, born in 2005.

We have reached the sixth episode of archivisms!

This article was written on December 29, 2023 from Cassandra

Cassandra Crossing 563/ Archivismi: API, when the going gets tough

Today we will move to a different level of use of the Internet Archive, that of "programming" via API; but first we will have to talk about the duties and responsibilities of Internet Archive users.

In the last two episodes (one is available today Full list of the articles of “Archivismi”) we dealt with elementary archiving on Internet Archive; However, archiving a single file has opened up a significant part of the system in front of us, and the powerful functions it makes available to us.

Much, much more remains to be shown, even just for manual archiving operations. Soon we will describe and carry out a real archiving campaign, describing the minutiae and small problems that distinguish real cases from the examples we find in manuals.

But today we will deal with a topic already mentioned in passing in a previous episode, and which brings the archival power that Internet Archive makes it available to its users at a new level. We are obviously talking about the possibility of "programming" operations on Internet Archive.

It doesn't take a genius to imagine a service like that Internet Archive it exists because it has a small army of programmers behind it who write, maintain and evolve a base of dedicated software. And incidentally, to foment the never-extinct “Ranking of the best programming languages", also in Internet Archive Python he rules the roost!

But let's get back to today's topic.

In short: yes, it is possible to use the Internet Archive using scripts or real programs that automate the archiving operations that we decide to carry out.

And yes, this is being accomplished”exposing an API”. For the comfort of non-programmers, it simply means that it is possible to automate the operations to be performed using scripts or actual programs, which execute, obviously via the Internet, precise calls to elementary Internet Archive functions, defined in a APIs — Interface for Application Programming.

There would be no need to say anything else, simply to provide the link again Internet Archive Developer Portal, and let anyone ever have busy, even just by creating a .bat script for DOS, discover and use the power of the Internet Archive API.

But no, a minimum of preliminary indications and recommendations are still necessary, before giving even just a very small example.

Primarily, Internet Archive places no predefined limits on what a user can do with the services it provides; for example, it does not limit a priori the amount of information that can be stored.

But no reality exposed to the public can be "defenseless”, given that a percentage of imbeciles, profiteers and criminals in the world are also present among Internet Archive users.

As the history of the Internet has repeatedly demonstrated, large collaborative entities, for example Wikipedia, can only survive and develop if managed as a hybrid between imperfect democracy and enlightened tyranny. Internet Archive is no exception.

This is why some resources, such as the Collections, are sipped and supplied only on request. A series of administrators of various levels supervise and control the functioning and use of the Internet Archive, and keep users in line, control or expel them dysfunctional. Such a presence should not be seen as a problem or a limit, but as a resource; in fact administrators have the main role of helping all users to use Internet Archive.

However, administrators are a precious and scarce resource; send an email to the administrators, when not directly foreseen by the procedures (for example for the creation of a Collection) it must be seen as a last resort, to be used only after careful reading of the documentation and online help, many tests, a search in the blog and why not, even on normal search engines. Listen to me!

But it wasn't said that we would scheduled something? Very true, and let's get straight to the practice. And to start from something simple and harmless, let's assume that we have found a series of things that interest us, for example several issues of a magazine, and we want to download them in a fast, reliable way that does not require repetitive manual operations.

And for simplicity's sake, we will do everything from the command line, without directly using the API and therefore without having to write a real program in Python or similar; we just need to download the Python program "ia” and use it. ia it is an already "pseudo-compiled" program, that is, written in a so-called intermediate "language". Python Bytecode, which is portable to any platform that has a Python3 environment installed.

Using a version of Linux, Debian, Ubuntu etc., is highly recommended. You can also use it in a Virtualbox or VMWare virtual machine on any computer.

The Windows WSL environment should also work, but here Cassandra does not proceed further and abandons those who are brave enough to try their hand at it; indeed, possibly wait for feedback from them in this regard to integrate this article.

So let's go back with Cassandra to her beloved Debian, and install and configure ia with the procedure we find here. But also a simple one

sudo apt install internetarchive

it's enough. Debian miracles…

In short, on a computer where the Python3 environment is installed we must download where we prefer, or install, the ia command, make it executable, and finally launch it with the parameter configure to associate it with our user (you created your user, right?).

It is all ready; as a first example with the following command we can download only the original pdf of our example article, which we uploaded last episode.

$ ./ia download cassandra-crossing-2558-il-dizionario-di-cassandra-archivismi — no-directories — format=”Text PDF”
cassandra-crossing-2558-the-dictionary-of-cassandra-archivisms:
downloading Cassandra_Crossing_2558_The Dictionary of Cassandra_ Archivismi.pdf: 100%|█| 513k/513k [00:00<00:00, 709kiB/s

But if we had wanted to download the entire object, derivative files included, we could have written even more simply

$ ./ia download cassandra-crossing-2558-il-dizionario-di-cassandra-archivismi

We would thus have obtained a directory with the same name as the object identifier, containing all the files from which it is formed. The same process also works to download an entire collection, or parts of it. Another recommendation, first calculate how large the selection you have made is; on Internet Archive there are objects of enormous dimensions.

For help, as well as consult the online guide, just give the commands

$ ./ia help
$ ./ia help download
$ ./ia help upload

We end with other recommendations in no particular order.

If you upload new objects, it is better to use the spreadsheet method in CSV format, of which you can find an example here or in the guide. This way you will always have all the parameters under control together. Giving all the parameters from the command line can be complex and mistakes can easily be made.

When you create your objects, always include them in the collection test_collection, as is also shown in the example sheet. We have already explained the reasons.

When you insert your first objects instead definitive, do not insert the collection among the parameters, leaving the default one opensource. Happy experimenting!

And that's all for today too. Stay tuned for the next episode of “Archivists”.

Marco Calamari

Write to Cassandra — Twitter — Mastodon
Video column “A chat with Cassandra”
Cassandra's Slog (Static Blog).
Cassandra's archive: school, training and thought

Archivists | Cassandra Crossing

Join communities

💙 donations

If you have found errors in the article you can report them by clicking here, Thank you!

By Marco Calamari

Write to Cassandra - @calamarim Cassandra's Prophecies: @XingCassandra Video column "A chat with Cassandra" Cassandra's Slog (Static Blog). Cassandra's archive: school, training and thought

View all of Marco Calamari's posts.

Comments

Each article corresponds to a post on Feddit where you can comment! ✍️ Click here to comment on this article ✍️

Feddit is the Italian alternative to Reddit managed by us, based on the software Lemmy, one of the most interesting projects of fediverse.

Cassandra Crossing 563/ Archivismi: API, when the going gets tough

Join communities

By Marco Calamari

Comments

Read also: