Publishing datasette to Google Cloud Compute with GitHub Actions

Simon Willison has a fascinating data-publishing and data-management project named datasette. A few months ago, he put together a plugin named datasette-ripgrep that uses ripgrep (you use ripgrep, right?) to search folders of files and display the results using datasette’s machinery.

I thought of creating a datasette-ripgrep instance to search all the packages from the Enthought Tool Suite. Using GitHub to search across this cohesive set of tools, and only this set of tools, doesn’t really work.

Setting datasette-ripgrep up locally turned out to be pretty easy. But publishing it to Google Cloud Compute (GCP) using GitHub Actions so I could automate the daily the content of the indexes repositories turned out to be a multi-month effort.

I started working off the demo deploy action which took me most of the way there. But I kept running into GCP authentication issues. It complained that “No credentials provided, skipping authentication”. That is, until I realized 2 months later (of on-and-off attempts) that I was putting GitHub secrets in Settings > Environment > Secrets, and not in Settings > Secrets. *slaps forehead* I’m sure actions can see secrets in the Environment section somehow, but I don’t know how. Another thing I learned is that when the GCP docs ask you to put the service account key in a GitHub secrets, you can just paste the whole JSON as-is.

The next hurdle was that the datasette publish cloudrun command would fail with the error “You do not appear to have access to project […]“. I tried many things related to IAM, role, service accounts and the likes, but without success. The ah ha! moment came when I realized/remembered that datasette.publish.cloudrun actually talks to GCP using the gcloud command line tool. I identified that it calls the builds and deploy subcommands. Using that information I could make searches to figure out which permissions were required to execute those commands. The one I was missing was Cloud Build Editor (and maybe Viewer).

In the end, the Service Account has the following roles (I’m not 100% sure they’re all necessary):

  • Cloud Build Editor
  • Compute Engine Service Agent
  • Service Account User
  • Cloud Run Admin
  • Storage Admin
  • Viewer

After 100 failed deploys and much reading of mediocre Medium articles and of Google’s (seemingly) incomplete and incorrect READMEs, the 101th deploy succeeded! You can now search the ETS repos at the very unglamorous URL of https://datasette-ripgrep-ets-alicuzwd4a-uc.a.run.app and see the source on GitHub.

What does %matplotlib do in IPython?

TLDR; Use %matplotlib if you want interactive plotting with matplotlib. If you’re only interested in the GUI’s event loop, %gui <backend> is sufficient.

I never really understood the difference between %gui and %matplotlib in IPython. One of my colleagues at Enthought once told me that at some point in his career, he more or less stopped reading documentation and instead went straight to the code. That’s what I did here. But let’s do a bit of history first.

In the “beginning”, there was pylab. It (still) is a module of matplotlib and was a flag to IPython designed to facilitate the adoption of Python as a numerical computing language by providing a MATLAB-like syntax.1 The reference was so explicit that before being renamed to pylab on Dec 9, 2004, the module was called matplotlib.matlab. IPython adopted the rename on the same day.2 With the ‑‑pylab flag or the %pylab magic function, IPython would set up matplotlib for interactive plotting and executed a number of imports from IPython, NumPy and matplotlib. Even thought it helped a few people transition to Python (including myself), it turned out to be a pretty bad idea from a usability point of view. Matthias Bussonnier wrote up a good list of the many things that are wrong with it in “No Pylab Thanks.”

For the 1.0.0 release of IPython in August 2013, all mentions of %pylab were removed from the examples (in a July 18, 2013 commit) and were replaced by calls to the %matplotlib magic function, which only enables interactive plotting but does not perform any imports. The %matplotlib function had already been introduced in a 2013 refactoring to separate the interatice plotting from the imports. The %gui magic command had already been introduced in 2009 by Brian Granger to “manage the events loops” (hint hint).

Now we know that the (my) confusion with %gui and %matplotlib started in 2013.

This analysis refers to IPython 7.8.0 and ipykernel 5.1.2.

Our entry point will be the %matplotlib magic command. Its source code is in the IPython.core.pylab.py file. The essential call is to shell.enable_matplotlib(gui), which is itself implemented in IPython.core.interactiveshell.InteractiveShell, and does five things:

  1. Select the “backend” given the choice of GUI event loop. This is done by calling IPython.core.pylabtools.find_gui_and_backend(gui). It encapsulates the logic to go from a GUI name, like "qt5" or "tk", to a backend name, like "Qt5Agg" and "TkAgg".
  2. Activate matplotlib for interactive use by calling IPython.core.pylabtools.activate_matplotlib(backend), which:
    1. Activates the interactive mode with matplotlib.interactive(True);
    2. Switches to the new backend with matplotlib.pyplot.switch_backend(backend);
    3. Replaces the matplotlib.pyplot.draw_if_interactive method with the same method, but wrapped by a flag_calls decorator, which adds a called flag to the method. That flag will be used by the new %run runner that’s introduced below at point #5;
  3. Configure inline figure support by calling IPython.core.pylabtools.configure_inline_support(shell, backend). This is where some very interesting stuff happens. It first checks that InlineBackend is actually importable from ipykernel.pylab.backend_inline, otherwise it returns immediately. But if it’s importable and the backend is "inline", it:
    1. Imports the ipykernel.pylab.backend_inline.flush_figures function, and register it as a callback for the "post_execute" event of the shell. As we’ll see later, callbacks for "post_execute" are called after executing every cell;
    2. If the backend was not "inline", it’ll unregister the flush_figures callback;
  4. Enable the GUI by calling shell.enable_gui(gui). This method is not implemented in the IPython.core.interactiveshell.InteractiveShell base class, but rather in IPython.terminal.interactiveshell.TerminalInteractiveShell. If a gui as specified, it gets the name of the active_eventloop and its corresponding inputhook function using IPython.terminal.pt_intputhooks.get_inputhook_name_and_func(gui). The active_eventloop is just a string, such as 'qt', but the inputhook is more interesting. It’s the function to call to start that GUI toolkit’s event loop. Let’s dig further into get_inputhook_name_and_func(gui). That function checks a few things, but it essentially:
    1. Imports the correct inputhook function for the chosen GUI by importing it from IPython.terminal.pt_intputhooks.<gui_mod>. For example, the Qt inputhook is imported from IPython.terminal.pt_intputhooks.qt. Later on, when inputhook is executed for Qt, it will:
      1. Create a QCoreApplication;
      2. Create a QEventLoop for that application;
      3. Execute the event loop and register the right events to make sure the loop is shut down properly. The exact operations to start and stop the loop are slightly different for other GUI toolkits, like tk, wx, or osx, but they all essentially do the same thing. At this point we’re ready to go back up the stack to enable_matplotlib in %matplotlib;
  5. Replace IPython’s default_runner with the one defined in IPython.core.pylabtools.mpl_runner. The default_runner is the function that executes code when using the %run magic. The mpl_runner:
    1. Saves the matplotlib.interactive state, and disables it;
    2. Executes the file;
    3. Restores the interactive state;
    4. Makes the rendering call, if the user asked for it, by checking the plt.draw_if_interactive.called flag that was introduced at point #1.3 above.

As for the other magic, %gui, it only executes a subset of what %matplotlib does. It only calls shell.enable_gui(gui), which is point #4 above. This means that if your application requires interaction with a GUI’s event loop, but doesn’t require matplotlib, then it’s sufficient to use %gui. For example, if you’re writing applications using TraitsUI or PyQt.

The Effect of Calling %gui and %matplotlib

Let’s start with the “simplest” one, %gui. If you execute it in a fresh IPython session, it’ll only start the event loop. On macOS, the obvious effect of this is to start the Rocket icon.

Animation of the Python rocket icon starting because of a call to `%gui`.

At that point, if you import matplotlib and call plt.plot(), no figure will appear unless you either call plt.show() afterwards, or manually enable interactive mode with plt.interactive(True).

On the other hand, if you start your session by calling %matplotlib, it’ll start the Rocket and activate matplotlib’s interactive mode. This way, if you call plt.plot(), your figure will show up immediately and your session will not be blocked.

Using %run

If you call %run my_script.py after calling %matplotlib, my_script.py will be executed with the mpl_runner introduced above at point #5.

Executing a Jupyter Notebok Cell When Using the "inline" Backend

In the terminal the IPython.terminal.interactiveshell.TerminalInteractiveShell.interact() method is where all the fun stuff happens. It prompts you for code, checks if you want to exit, and then executes the cell with InteractiveShell.run_cell(code) and then trigger the "post_execute" event for which we’ve registered the ipykernel.pylab.backend_inline.flush_figures callback. As you might have noticed, the flush_figures function comes from ipykernel, and not from IPython. It tries to return all the figures produced by the cell as PNG of SVG, displays them on screen using IPython’s display function, and then closes all the figures, so matplotlib doesn’t end up littered will all the figures we’ve ever plotted.

Conclusion

To sum it up, use %matplotlib if you want interactive plotting with matplotlib. If you’re only interested in the GUI’s event loop, %gui <backend> is sufficient._ Although as far as I understand, there’s nothing very wrong with using %matplotlib all the time.


  1. Basically, no namespaces, and direct access to functions like plot, figure, subplot, etc. [return]
  2. The earliest commit I found for the IPyhon project was on July 6, 2005 by Fernando Perez, 7 months after the name change. Its Git hash is 6f629fcc23ba63342548f61cc7307eeef4f55799. But the earliest mention is an August 2004 entry in the ChangeLog: “ipythonrc-pylab: Add matplotlib support,” which is before the offical rename in matplotlib. [return]

Manually Merging Day One Journals

My first Day One entry is from January 24, 2012. I used it often to take note about what I was doing during my PhD with the #wwid tag (what was I doing, an idea from Brett Terpstra, I think), and sometimes to clarify some thoughts.

When Day One went The Way of the Subscription, I didn’t bother too much because Dropbox sync still worked. Until it didn’t. I somehow didn’t realized it and kept adding entries to both the iOS and the macOS versions. Not good. It’s been on my to do list for a while to find a way to merge the two journals. I could probably subscribe to the Day One sync service and have it figure out the merging but I didn’t want to subscribe just for that.

I learned somewhere that Day One 2 could export journals as a folder of photos and a JSON file. I figure I could probably write a script to do the merging. So I downloaded Day One 2 on my iPhone and Mac, imported my Day One Classic journals, exported them as JSON to a folder on my Mac, and unzipped them. I also created a merged/ folder where to put the merged journal. The hierarchy looks like this:

$ tree -L 2
.
├── Journal-JSON-ios/
│   ├── Journal.json
│   └── photos/
├── Journal-JSON-ios.zip
├── Journal-JSON-mac/
│   ├── Journal.json
│   └── photos/
├── Journal-JSON-mac.zip
├── merge_journals.py
└── merged/

I first copied the photo folder from Journal-JSON-ios/ to merged/ and the photos from Journal-JSON-mac/photos/. I was pretty confident that I would end up with the union of all the photos because Day One uses UUIDs to identify each photo. The -n option to cp prevents overwriting files.

$ cp -r Journal-JSON-ios/photos merged/
$ cp -n Journal-JSON-mac/photos merged/photos/

I then ran the merge_journals.py script (below) to do a similar merge of the entries, based on the UUIDs. The merging happens by building a dictionary with UUID of each entry as the key and the entry itself as the value. It’s two loops over the iOS and the macOS entries. Entries with the same UUID should have the same contents, unless I’ve edited some metadata on one platform but not the other. I’m not too worried about that.

The output dictionary will be written to the Journal.json file. The entries are sorted chronologically because that’s how it was in the exported journal files, but I doubt it matters.

The output dictionary is written to disk without enforcing the conversion to ASCII since the exported journals are encoded using UTF-8. The indent is there to make the output more readable and diff-able with the exported journals.

import json

with open('./Journal-JSON-ios/Journal.json') as f:
    ios = json.load(f)
with open('./Journal-JSON-mac/Journal.json') as f:
    mac = json.load(f)

# Extract and merge UUIDs
uniques = {entry['uuid']: entry for entry in ios['entries']}
for entry in mac['entries']:
    uniques[entry['uuid']] = entry

# Create the output JSON data structure
output = {}
output['metadata'] = mac['metadata']
output['entries'] = list(uniques.values())
# I'm not sure it matters, but Day One usually exports the entries
# in chronological order
output['entries'].sort(key=lambda e: e['creationDate'])

# ensure_ascii print unicode characters as-is.
with open('merged/Journal.json', 'w', encoding='utf-8') as f:
    json.dump(output, f, indent=True, ensure_ascii=False)

The last step is to zip the journal and photos together, which tripped me up a few times. The Journal.json and the photos/ folder must be at the top level of the archive, so I zip the file from within the merged/ folder and then move it back up one level.

$ cd merged
$ zip -r merged.zip *
$ mv merged.zip ..

I could then import merged.zip in Day One, which created a new Journal, and delete the old one.

I guess I could somewhat automate this to roll my own, DIY, sync between versions of Day One, but I’d rather pay them money once I decide to use Day One frequently again. Still, I really appreciate that the Day One developers picked formats that could be manipulated so easily.