Establish a New Schema
We want to establish a new schema so that we have a clean slate to work with and because there will be some incompatible changes with the current schema. An example of this is the Post.post_id
column is a reference to the reddit id, but in the future needs to be the primary key of the row (see Reddit Ids). We can’t do that in-place with the current table.
Channels
Channels should have the fields carried over with the following adjustments:
channel_id
->id
Posts
Posts should have the fields carried over with the following adjustments:
post_id
->id
reports_ignored = BooleanField(default=False)
- Flag to indicate whether we’re ignoring subsequent reports of the post
- See “Moderation” for more details
text
/preview_text
/plain_text
- We currently have several representations of textual post content, the 3 fields/properties listed above:
text
is the markdown text for text postspreview_text
is precomputed plain text representation of link posts, sourced from and Embedly API request for that linkplain_text
is a@property
, not a db column, but its logic is as follows:- If the post is an article, render the article to plain text
- If the post is a text post, render the markdown to plain text
- Otherwise use
preview_text
- See also: https://github.com/mitodl/open-discussions/blob/9a96f09ab4c8c617da3ba8842adb9c419c5d321d/docs/rfcs/0002-indexing-embedly-link-preview.md#4-related-but-separate-issue-getting-rid-of-either-the-plain_text-or-text-field-in-es-documents
- We should to normalize
preview_text
/plain_text
down to a single stored field so we’re not recomputing it every time a read occurs. - We still need to keep the
text
field and thearticle
relation
- We currently have several representations of textual post content, the 3 fields/properties listed above:
Comments
Comments should have the fields carried over with the following adjustments:
channel_id
->id
reports_ignored = BooleanField(default=False)
- Flag to indicate whether we’re ignoring subsequent reports of the comment
- See “Moderation” for more details
Comment Tree
Prior art:
- SQL Antipatterns (“Naive Trees”):http://www.r-5.org/files/books/computers/languages/sql/style/Bill_Karwin-SQL_Antipatterns-EN.pdf
- Models for Hierarchical Data: https://www.slideshare.net/billkarwin/models-for-hierarchical-data
- Django treebeard: https://django-treebeard.readthedocs.io/en/latest/
Reddit stores comment trees in Cassandra, which is overkill for our current use cases. That said, we need to be able to migrate the comments in such a manner that we achieve the following:
- Maintain current API result structures
- Specifically, we need to support some form of reddit’s “more comments” pagination. This does not need to partition the comments exactly as reddit would, the interface just needs to be compatible with how the frontend traverses it.
- Support currently used sorting methods (best, old, new)
- Support rendering an arbitrary subtree of the comments for the comment detail view
From the Prior Art listed above, Materialized Paths are probably our best bet and we’re already familiar to some degree with them because they’re what wagtail
uses under the hood for page hierarchies.
The big thing here is that in order to support the sorts we need one table/model for each type of sort. That precludes using a single table for everything but it simplifies reads because we can pick our sorting data model and then prefetch_related
on the Comment
table so that code can work generically over similarly structured tables. This approach means these sorting hierarchy tables end up being a sorted lookup table for extracting a portion of the comment tree.
An example of what the data model would look like (working branch here):
from django.db import models
from treebeard.mp_tree import MP_Node
class Post(models.Model):
pass
class Comment(models.Model):
post = models.ForeignKey(Post, on_delete=models.CASCADE)
class CommentTreeNode(MP_Node):
# specialize the alphabet for postgres
# see: https://django-treebeard.readthedocs.io/en/latest/mp_tree.html#treebeard.mp_tree.MP_Node.alphabet
alphabet = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
# a post can have many comments, but a reverse lookup into this tree isn't pragmatic
post = models.ForeignKey(
Post, on_delete=models.CASCADE, related_name="+"
)
class Meta:
abstract = True
class NewCommentTreeNode(CommentTreeNode):
comment = models.OneToOneField(
Comment,
on_delete=models.CASCADE,
unique=True,
null=True,
related_name="new_comment_node",
)
created = models.DateTimeField()
node_order_by = ["-created"]
class OldCommentTreeNode(CommentTreeNode):
comment = models.OneToOneField(
Comment,
on_delete=models.CASCADE,
unique=True,
null=True,
related_name="old_comment_node",
)
created = models.DateTimeField()
node_order_by = ["created"]
class BestCommentTreeNode(CommentTreeNode):
comment = models.OneToOneField(
Comment,
on_delete=models.CASCADE,
unique=True,
null=True,
related_name="best_comment_node",
)
score = models.IntegerField()
node_order_by = ["-score"]
When a sorted column changes, it requires a move of the node. For the Old/New sorts those will never change. For the Best sort, it will change whenever a vote is made or updated. Fortunately, Materialized Paths are pretty efficient at this. An example with the above data models is:
from django.db import transaction
post = Post.objects.create()
comment1 = Comment.objects.create(post=post)
comment2 = Comment.objects.create(post=post)
comment3 = Comment.objects.create(post=post)
@transaction.atomic
def rescore(node, new_score):
node.score = score
node.save()
node.move(node)
# comment trees start with a root node w/ `comment=None`
# this would get created when the post is created
root = BestCommentTreeNode.add_root(post=post, score=0)
# top level comment
c1n = root.add_child(comment=comment1, score=1)
# replies to comment1
c2n = c1n.add_child(comment=comment1, score=4)
c3n = c1n.add_child(comment=comment1, score=5)
# c3n should list above c2n
print(root.dump_bulk())
rescore(c2n, 8)
# c2n is now sorted above c3n and has the updated score
print(root.dump_bulk())
This results 4 queries, two SELECT ... LIMIT 1
and two UPDATE ...
queries:
UPDATE "discussions_bestcommenttreenode" SET "path" = '000100010002', "depth" = 3, "numchild" = 0, "post_id" = 2, "comment_id" = 3, "score" = 8 WHERE "discussions_bestcommenttreenode"."id" = 5;
SELECT "discussions_bestcommenttreenode"."id", "discussions_bestcommenttreenode"."path", "discussions_bestcommenttreenode"."depth", "discussions_bestcommenttreenode"."numchild", "discussions_bestcommenttreenode"."post_id", "discussions_bestcommenttreenode"."comment_id", "discussions_bestcommenttreenode"."score" FROM "discussions_bestcommenttreenode" WHERE ("discussions_bestcommenttreenode"."depth" = 3 AND "discussions_bestcommenttreenode"."path" BETWEEN '000100010000' AND '00010001zzzz' AND "discussions_bestcommenttreenode"."score" > 8) ORDER BY "discussions_bestcommenttreenode"."path" ASC LIMIT 1;
SELECT "discussions_bestcommenttreenode"."id", "discussions_bestcommenttreenode"."path", "discussions_bestcommenttreenode"."depth", "discussions_bestcommenttreenode"."numchild", "discussions_bestcommenttreenode"."post_id", "discussions_bestcommenttreenode"."comment_id", "discussions_bestcommenttreenode"."score" FROM "discussions_bestcommenttreenode" WHERE ("discussions_bestcommenttreenode"."depth" = 3 AND "discussions_bestcommenttreenode"."path" BETWEEN '000100010000' AND '00010001zzzz') ORDER BY "discussions_bestcommenttreenode"."path" DESC LIMIT 1;
UPDATE "discussions_bestcommenttreenode" SET path='000100010005'||SUBSTR(path, 13) WHERE path LIKE '000100010002%';
Votes
We should have a table for PostVote
and CommentVote
, these should have the following columns:
comment = ForeignKey(Comment)
/post = ForeignKey(Post)
- item that was voted onvoter = ForeignKey(User)
- the user who made the votedirection = SmallIntegerField()
- we store either a1
or-1
to indicate which direction the user voted. In practice,PostVote
will always have adirection == 1
, but it’s useful to carve this out now in case that changes at a later date. It’s also useful to be able to run aSUM()
over that column.
Votes are probably the trickiest moving piece in this reddit migration because it’s impractical to get the data directly from the reddit backend for a backpopulate. This probably means we’re going to need either some kind of ETL process. This is only a 1-time event, so an approach as basic as an export/import of a CSV would be adequate.
Reported Content
We need a data model for reported content (e.g. report post/comment to a moderator).
We need to support the following functionality:
- User can submit a report for a given post or comment
- One record per post or comment is sufficient and likely desired since the moderator is only going to act on a post or comment once - additional records would superfluous.
- We don’t need to store the reason the user enters for the report
- Moderators can act on the report, the available actions are:
- Approve: report is marked as resolved, post/comment remains visible
- Ignore: report is marked as resolved, post/comment remains visible, subsequent reports are suppressed
- Remove: report is marked as resolved, post/comment is removed and no longer visible, subsequent reports are suppressed
We don’t need much data here, the following is sufficient:
GenericForeignKey
toPost
/Comment
- we should only create a new report if there is either no existing report or if an existing report is not resolved
- We can accomplish this with a partial unique index combined with a
get_or_create
withresolved_on=None
:constraints = [UniqueConstraint( fields=["content_type", "content_id"], condition=Q(resolved_on=None) )]
resolved_on = DateTimeField(null=True)
- when a report is resolved, this is set the the current datetime, nullable datetimes have useful properties that we might want to take advantage of down the road
- Additionally, if the action is to remove the item, that side effect occurs at this time as well
resolved_by = ForeignKey(User)
- probably useful to know who resolved it- No need to pull in historial reports from reddit