0002: Indexing Link Previews

Table of Contents

Abstract

Read the description of #1811 for a description of the problem we’re solving here. This RFC is to address some decisions about how we go about indexing content from Embedly responses and the index structure in general. The goal of this RFC is to call attention to some of the finer points of implementation.

Architecture

Using a more descriptive Embedly API endpoint

Here’s a direct comparison between two Embedly endpoints using the same URL:

extract potentially gives us much more text content to index in the content field than oembed does in the description field (this isn’t always true though).

Proposal: Use the extract endpoint instead of oembed; if content exists, strip the HTML and save that as the link preview text since it’s more complete/descriptive; if content doesn’t exist, use the description value.

Drawbacks: Larger responses both from Embedly and from our own API (since ES documents for link posts will be larger), but that’s very unlikely to have any significant impact on performance. Text/article posts are already liable to have a similar amount of raw content.

Here’s a partial document from an embedly response:

{
  "content": "<div><p>NPY acts by sticking to receptor proteins, and the pharmaceutical industry ...",
  "title": "A New Way to Keep Mosquitoes From Biting",
  "provider_name": "The Atlantic",
  ...
}

The title and provider_name are potentially significant in terms of the search-ability and relatability of these document (example: if I search “Atlantic”, it makes some sense to include this link post from the publication called “The Atlantic” in the results).

Proposal: Prepend provider_name and title to the text that we index for link posts, and separate the values by dashes (et al). Using the example above, the text that we index would look like this: "The Atlantic - A New Way to Keep Mosquitoes From Biting - NPY acts by sticking to receptor proteins, and ..."

Proposal: Save the preview text for a link post in the channels.models.Post.text field (we wouldn’t attempt to save text for the corresponding reddit object)

Potential Issues: There might still be some application logic on the back- or front-end that considers a post with non-blank text value to be a self/text post. This would break that logic.

After reviewing this a bit, it became clear that we don’t seem to be using the ES documents’ text field. We only care about the plain text content of these posts (stored in the plain_text field).

Proposal: (a) Get rid of the text field in ES documents, or (b) get rid of plain_text, and instead store only plain text content in the text field.

Drawbacks: For (a) there may still be some application logic that makes use of that value. For (b) it could be jarring to have two properties with the same name on the model object and its corresponding ES document with different values.

If this makes sense, this change would be made in a separate issue (i.e.: not as part of #1811)