0005: OCW Video Resources
Table of Contents
Abstract
Parse the course_embedded_media section of OCW parsed.json files. Search for Youtube videos within that section. If found, add the video as a resource, including the thumbnail image and transcript text.
Acceptance Criteria
- Each Youtube video found for an OCW will be saved as a
ContentFileresource in the database - The resource will include a thumbnail image and transcript text if found.
- The OpenSearch index will include the resource along with the thumbnail and text.
Architecture Changes
The ContentFile model will need an additional nullable field, image_src, to store the value of the video thumbnail if it exists. The existing content attribute can be used to hold the transcript text. The current ETL process for OCW resources will need to be modified to process the course_embedded_media section:
- Youtube videos in that section will have a
title&idof “Video-YouTube-Stream”, with amedia_locationequal to the Youtube video id. - Once a Youtube video is identified, check for the existence of a
Videoobject with the samevideo_id - If a matching video is found, copy that video’s
image_srcandtranscriptvalues to the resource’simage_srcandcontentfields. - If a matching
Videoobject is NOT found:- Find the
course_embedded_mediaobject withidof “Thumbnail-YouTube-JPG” and use itsmedia_locationas theimage_srcvalue. - If the
transcriptattribute of the object exists and use that (stripped of HTML via BeautifulSoup) for thecontentvalue. - If the
contentis still empty/null, find thecourse_embedded_mediaobject withidof “.pdf". Use tika to retrieve (via `technical_location` url) and parse the contents and save to the `content` attribute.
- Find the
- The
ContentFilewill have acontent_typeof “video” and afile_typeof “youtube” - If no Youtube video object is found, ignore the section and move on.
Security Considerations
None
Testing & Rollout
After running the get_ocw_files function, there should be new ContentFile objects for each Youtube video found. These new resources should also be in the OpenSearch index with populated image_src and content fields.