1
Fork 0
recovering-deleted-wagtail-.../slides.md

358 lines
8.8 KiB
Markdown
Raw Normal View History

2024-04-05 18:01:50 +01:00
---
title: Recovering deleted Wagtail pages and Django models
class: text-center
highlighter: shiki
transition: slide-left
2024-04-18 17:05:39 +01:00
mdc: true
themeConfig:
primary: '#fd5765'
2024-04-05 18:01:50 +01:00
---
2024-05-17 12:37:47 +01:00
# Recovering [deleted]{style="color: #fd5765"} Wagtail pages and/or Django models
2024-04-05 18:01:50 +01:00
2024-06-11 21:03:01 +01:00
### Jake Howard{.mt-10}
2024-04-05 18:01:50 +01:00
2024-05-17 12:37:47 +01:00
<ul class="list-none! text-sm [&>li]:m-0! mt-1 uppercase">
<li>Senior Systems Engineer @ Torchbox</li>
2024-05-29 14:42:49 +01:00
<li>Core Team, Security Team & Performance Working Group @ Wagtail</li>
2024-04-18 17:05:39 +01:00
</ul>
<ul class="list-none! text-sm [&>li]:m-0! mt-3">
<li><mdi-earth /> theorangeone.net</li>
<li><mdi-twitter /> @RealOrangeOne</li>
<li><mdi-github /> @RealOrangeOne</li>
<li><mdi-mastodon /> @jake@theorangeone.net</li>
</ul>
2024-04-05 18:01:50 +01:00
---
layout: cover
2024-05-17 12:37:47 +01:00
background: /intranet.png
2024-04-05 18:01:50 +01:00
---
# Setting the scene
2024-06-11 21:03:01 +01:00
<!--
- People usually use Wagtail as a website or blog
- But it works really well as an intranet too
- At Torchbox, we use it for internal documentation ("intranet")
- Processes
- Company information
- Links to other places etc
- Been around for a while
- In 2022, we restructured the content
- Make it easier to find things
- Remove duplication
- This didn't quite go to plan
- One afternoon, I was looking to reference a process, and couldn't find it
- Turns out, the entire "Sysadmin" section had completely vanished
-->
2024-04-05 18:01:50 +01:00
---
layout: cover
2024-05-17 12:37:47 +01:00
background: /site-history.png
2024-04-05 18:01:50 +01:00
---
# Site history report
2024-06-11 21:03:01 +01:00
<!--
- First step: Understanding what happened
- The site history report!
- Fortunately, Wagtail showed _almost_ exactly what had happened, and what I expected
- One staff member deleted the "Sysadmin" section a few days before
- Which deleted every page under it, all 105 of them
- "Radical reorganisation"
-->
2024-05-17 12:37:47 +01:00
---
layout: image
image: /chat.png
backgroundSize: contain
---
2024-06-11 21:03:01 +01:00
<!--
- I messaged the person, to better understand what happened
- Assuming they didn't mean to delete all that content
- Hanlon's Razor
- They'd made a new "Sysadmin" section a while ago, before switching strategy to move pages in the existing tree
- They then deleted the wrong one
- Sure, Wagtail shows a confirmation when you're deleting pages, but when you're deleting a lot of pages, and expecting to delete pages, you might not read the message perfectly
- With the content gone, I had to restore from backups.
-->
2024-04-05 18:01:50 +01:00
---
2024-05-10 17:53:24 +01:00
layout: section
2024-04-05 18:01:50 +01:00
---
# Restoring from backups
2024-06-11 21:03:01 +01:00
<!--
- Our intranet is a living document, it gets updated fairly often
- Rolling back the entire system almost 2 days would have meant potentially losing critical changes
- Not to mention people's time they spent making the changes
- It'd be annoying, but we _could_ do it, but I'd rather another solution
-->
2024-04-05 18:01:50 +01:00
---
2024-05-10 17:53:24 +01:00
layout: section
2024-04-05 18:01:50 +01:00
---
2024-05-17 12:37:47 +01:00
# _Partially_ restoring from backups
2024-04-05 18:01:50 +01:00
2024-06-11 21:03:01 +01:00
<!--
- Ideally, what I needed was to restore only the sysadmin pages, leaving all others completely untouched.
- Using a few tricks of Django and Wagtail internals, it's absolutely possible, and we did it
- With 0 downtime, too!
-->
2024-04-05 18:01:50 +01:00
---
2024-05-10 17:53:24 +01:00
layout: section
2024-04-05 18:01:50 +01:00
---
2024-05-10 17:53:24 +01:00
## 1.
# Spin up a database backup
2024-04-05 18:01:50 +01:00
2024-06-11 21:03:01 +01:00
<!--
- We backup our intranet nightly, so I downloaded a backup from before the incident
- Start the codebase locally so I can interrogate it
-->
2024-04-05 18:01:50 +01:00
---
2024-05-10 17:53:24 +01:00
layout: section
2024-04-05 18:01:50 +01:00
---
2024-05-10 17:53:24 +01:00
## 2.
# Locate the page models
2024-04-05 18:01:50 +01:00
2024-06-09 21:33:13 +01:00
<div class="pt-5 text-left">
2024-04-05 18:01:50 +01:00
```python
from wagtail.models import Page
sysadmin_page = Page.objects.get(id=91)
child_pages = sysadmin_page.get_descendants()
```
</div>
2024-06-11 21:03:01 +01:00
<!--
- Behind the scenes, Wagtail pages are a tree, implemented using `django-treebeard`.
- When a page is deleted, treebeard is the one who finds all the child pages and deletes them too
- And then Django and postgres deal with cascading the delete
-->
2024-05-10 17:53:24 +01:00
---
layout: section
---
## 3.
# Locate what was deleted
2024-04-05 18:01:50 +01:00
2024-06-09 21:33:13 +01:00
<div class="pt-5 text-left">
2024-04-05 18:01:50 +01:00
```python
from django.contrib.admin.utils import NestedObjects
collector = NestedObjects()
collector.collect(list(child_pages) + [sysadmin_page])
```
</div>
2024-06-11 21:03:01 +01:00
<!--
This is where the magic happens
- Deleting a page deletes more than just a page
- The specific model
- Revisions
- Related models
- Through tables
- `get_descendants` won't get all those
- Calling `.delete` gives you the number of objects, and it's quite a lot
- If you've ever used the Django admin, you know it's capable of finding every model instance before a delete
- That's implemented with an undocumented but simple to use API
- Yes, that's really it. It doesn't delete the models, it just tells us what _would_ be if we triggered a delete.
-->
2024-05-10 17:53:24 +01:00
---
layout: section
---
2024-06-09 21:33:13 +01:00
## 4.
# Serialize
2024-04-05 18:01:50 +01:00
2024-06-09 21:33:13 +01:00
<div class="pt-5 text-left">
2024-04-05 18:01:50 +01:00
2024-06-11 21:03:01 +01:00
```python {all|3-5}
2024-04-05 18:01:50 +01:00
from django.core import serializers
class NoM2MSerializer(Serializer):
def handle_m2m_field(self, obj, field):
pass
def get_model_instances():
for qs in collector.data.values():
yield from qs
with open("deleted-models.json", "w") as f:
NoM2MSerializer().serialize(
get_model_instances(),
stream=f
)
```
</div>
2024-05-10 17:53:24 +01:00
<style>
pre.shiki {
2024-06-09 21:33:13 +01:00
font-size: 0.8rem !important;
line-height: 18px;
2024-05-10 17:53:24 +01:00
}
</style>
2024-06-11 21:03:01 +01:00
<!--
- `collector.data` now contains all the model instances which were deleted, in memory on my laptop
- My laptop isn't what's running production
- Need to serialize the models into an intermdiary format which can be then be loaded onto production
- If you're thinking of fixtures, you're right
- Django's fixtures create a JSON representation of a model, so they can be saved in 1 location and loaded into another
- Mostly useful for complex test fixtures (hence the name), but generally useful for cases like this
- [click]`NoM2MSerializer` is a bit special
- When Django serializes a model with a m2m which doesn't use a custom table, it inlines the definition, because it's easier to work with
- However, `NestedObjects` still finds these through tables, and tries to load them separately
- Resulting in duplicate objects and referential integrity issues
- Instead, we exclude them
-->
2024-04-05 18:01:50 +01:00
---
2024-05-10 17:53:24 +01:00
layout: section
2024-04-05 18:01:50 +01:00
---
2024-05-10 17:53:24 +01:00
## 4a.
2024-05-17 12:37:47 +01:00
# [De]{.italic}serialize
### `manage.py loaddata`
2024-04-05 18:01:50 +01:00
2024-06-11 21:03:01 +01:00
<!--
- We have a JSON file, the inverse is just `manage.py loaddata`
-->
2024-04-05 18:01:50 +01:00
---
layout: center
---
2024-05-10 17:53:24 +01:00
# `restore-deleted-pages.py`
```python {all}{lines:true}
from django.contrib.admin.utils import NestedObjects
from django.core import serializers
from wagtail.models import Page
class NoM2MSerializer(Serializer):
def handle_m2m_field(self, obj, field):
pass
sysadmin_page = Page.objects.get(id=91)
child_pages = sysadmin_page.get_descendants()
collector = NestedObjects()
collector.collect(list(child_pages) + [sysadmin_page])
def get_model_instances():
for qs in collector.data.values():
yield from qs
with open("deleted-models.json", "w") as f:
NoM2MSerializer().serialize(
get_model_instances(),
stream=f
)
```
2024-06-09 21:33:13 +01:00
<style>
pre.shiki {
font-size: 0.7rem !important;
line-height: 17px;
}
</style>
2024-06-11 21:03:01 +01:00
<!--
- If we combine it all together, this is the big script we end up with
-->
2024-05-10 17:53:24 +01:00
---
2024-05-17 12:37:47 +01:00
layout: fact
2024-05-10 17:53:24 +01:00
---
2024-05-17 12:37:47 +01:00
### 5.
2024-05-10 17:53:24 +01:00
# **Test!**
2024-04-05 18:01:50 +01:00
2024-06-11 21:03:01 +01:00
<!--
- For what I hope are obvious reasons, this needed to be tested!
- I deleted the page through the wagtail admin locally, and then restored them to confirm they're all the same
- I'm glad I did, because there was an issue: Search indexes
- The search index objects (we use postgres) were picked up by `NestedObjects`
- They didn't like being restored
- So I skipped them and moved on, knowing I'd just rebuild the index later.
- `manage.py fixtree` also reports any tree issues, which there weren't
-->
2024-04-05 18:01:50 +01:00
---
2024-05-10 17:53:24 +01:00
layout: image-right
image: /red-button.png
class: flex justify-center flex-col items-center
2024-04-05 18:01:50 +01:00
---
2024-05-10 17:53:24 +01:00
### 6.
# Showtime!
2024-04-05 18:01:50 +01:00
2024-06-11 21:03:01 +01:00
<v-clicks>
1. Backup! ✅
2. Send `deleted-models.json` to server ✅
3. `loaddata`
4. `checktree`
5. `update_index`
6. `rebuild_reference_index`
2024-05-17 12:37:47 +01:00
2024-06-11 21:03:01 +01:00
</v-clicks>
<!--
- The tense bit
- Once I was happy, I ran the same steps on production
- Our intranet runs on Heroku, so I had to do a few dances to get the JSON file up there.
- [click]Before I began, I did a backup, because I'm a good sysadmin
- [click]With the data file in place, [click]I crossed everything and ran `loaddata`
- Pages popped up in the admin as if they never left
- [click]`checktree` worked.
- [click]`update_index` worked.
- [click] As did `rebuild_reference_index`
- The new pages were now findable
-->
2024-05-17 12:37:47 +01:00
2024-04-05 18:01:50 +01:00
---
2024-05-10 17:53:24 +01:00
layout: cover
2024-05-17 12:37:47 +01:00
background: /sysadmin.png
2024-04-05 18:01:50 +01:00
---
# Conclusion
2024-05-10 17:53:24 +01:00
2024-06-11 21:03:01 +01:00
<!--
- With a few hours work, the pages were back
- There was no downtime
- No content freeze
- No data loss
- Most people didn't even know there was an issue
- I've used this trick a a few times in my career, for both Wagtail and plain Django sites
- Ironically, just a few weeks after the blog post was published
- Works identically for Django sites, so long as you know how to reconstruct the delete query.
- Hopefully this helps you out as much as it has me!
-->
2024-05-10 17:53:24 +01:00
---
layout: end
---
2024-05-10 17:55:23 +01:00
2024-06-11 21:03:01 +01:00
https://wagtail.org/blog/recovering-deleted-pages-and-models/