Archiving YouTube videos by subtitled language
Intro
youtube-dl allows users to easily archive vast amounts of YouTube videos with minimal input on the user’s part. However, channels in a foreign language that offer subtitles on select videos present a challenge - how does one only download videos that are subtitled in their desired language?
The situation
Best Motoring International is the YouTube channel of the Japanese publisher Best Motoring. Since 1987, they have been covering tuner car culture in Japan via magazines and home videos, first on VHS, then DVD. Considered Japan’s preeminent motoring publisher throughout the 90s and 00s, it provides an important look at Japan’s car culture during this time. Only a small portion of their catalog was released in English until the past few years when, through their YouTube channel, they have been translating and adding subtitles to their various volumes old and new.
Since the channel has 1,400 videos (as of July 24, 2022) and the vast majority of them do not have English subs. To archive these videos and subtitles we must somehow create a list of videos that have English subtitles.
The battleplan
We’re going to scan the channel and download the subtitles from videos that have manually created English subtitles. Unfortunately, while this command filters subtitles, it will still download videos regardless of what subtitles it has. So we’ll need to create a list of video IDs from the subtitles it downloads, then feed that list into our software to download just the videos with English subtitles.
Step 0: Prepare our folder and download tools
I will be using Windows but these techniques and tools can often be applied to other OSes.
1. Let’s create a project folder at the root of one of your hard drives for simplicity. I’m calling mine BMI and placing it in my D: drive.
2. Create two empty folders in that directory called sub and videos.
3. Download our tools:
a. We’ll need a copy of either youtube-dl or yt-dlp. As of the time of writing this guide, youtube-dl is severely speed limited and you must use the fork yt-dlp to use your connection speed (80kb/sec vs 20mb/sec in my case). The commands are identical between the two softwares.
b. ffmpeg.
c. To send commands to these applications, I will be using PowerShell (install here if needed).
My directory D:/BMI now has yt-dlp.exe and ffmpeg.exe (which I grabbed from the bin folder of the ffmpeg archive) and the two empty folders.
To quickly open a PowerShell terminal in your desired folder, right click on the BMI folder (or on a blank space while in the folder) and click Open PowerShell Here (W10) or Open in Windows Terminal (W11).
Step 1: What language subtitle are we downloading?
90% of the subtitled videos on the Best Motoring International channel are filed under English (en), but a few are filed under English-United States (en-US), so I had to add that to the list of languages. This is chosen manually by the channel creators, who may not have been perfectly consistent. For example, one might need to add en-UK or en-CA to this list, depending on what the creator selected.
If you want to see the languages a video has subtitles for and the corresponding language code, the option --list-subs will list manually created subs along with automatic captions in two separate categories. This command’s output is very long, so I recommend running it only on one video. Example command:
./yt-dlp --skip-download --list-subs https://www.youtube.com/watch?v=cUVrgmKS_KA
While it will show every language under the sun, what we are looking for as at the bottom of the output, which shows subtitles that were created by the user and whether they labeled them en, en-us, en-UK, or en-CA.
[info] Available subtitles for cUVrgmKS_KA:
Language Name Formats
en English vtt, ttml, srv3, srv2, srv1, json3
Step 2: Create our command to scan the channel
Here’s our goals for this command. We want to:
A. Scan all the videos of the Best Motoring International YouTube channel (https://www.youtube.com/c/besmo/videos)
B. Download subtitles for videos only if they have English subtitles. (--sub-lang en,en-US)
C. Not download the videos, for now. (--skip-download)
D. Write the subtitle file name as the video ID and place them in the subfolder sub. (-o ‘sub/%(id)s.%(ext)s’)
./yt-dlp --skip-download --write-sub --sub-lang en,en-US https://www.youtube.com/c/besmo/videos -o 'sub/%(id)s.%(ext)s'
Step 3: Test and run the command
To paste into Powershell, just right click anywhere. We’re going to run our test command above and take a look at the output and the subtitle files it is downloading. Rather than waiting until it’s finished processing to spot potential issues, we’re going to cancel it if everything looks as expected by pressing Ctrl+C (yes, our favorite copy command does something different on the command line).
Don’t cancel it too early, though - wait until it says [download] Downloading video 1 of 1400 and has downloaded a few subtitle files (shown by lines such as [info] Writing video subtitles to: XXX.en.vtt).
Take a look at what it’s downloading (or not) and make tweaks as necessary for your situation. If all looks good, re-run the command (press the up arrow on the keyboard automatically fills in the most recently run command) and let it finish processing. And before we move on, let’s backup our command for future reference into a .txt file (open Notepad, paste the command, and save the file as scan.txt in our project directory).
Step 4: Create a list of video IDs that have subtitles
After scanning the channel and downloading only the English subtitles from the channel creator, we have narrowed 1,400 videos to 185 videos.
By default, youtube-dl’s filename includes the title of the video, but with our command, the files only have the video ID and the file extention. This allows us to easily create a list of the video IDs of the videos we want to download.
1. Head into your sub folder and select all files by pressing ctrl+A. Right click and select Copy as path.
2. Open Notepad and paste.
3. Select the first part of each line that includes the directory (in my case “D:/BMI/) and copy it. Press ctrl+H which opens Find & Replace. Paste in the above into Find and leave Replace blank. Click Replace All.
4. Now type in the file extension into Find (in my case .en.vtt”) and again leave Replace blank. Click Replace All. Run again as needed for the language variations (in my case .en-US.vtt").
You should now have a clean list of video IDs with no extra characters or lines. Scroll down the file and glance to confirm this. I’m going to save this in my BMI directory and call it good.txt.
Step 5: Download only the videos that have subtitles
We need to build another Powershell command, this time to input our final list of videos and download them. This time we want to:
A. Download only videos on our good list. (-a good.txt)
B. Merge the output into a single MKV file. (--merge-output-format “mkv”)
C. Download the subtitle files in our desired languages, same as above. (--write-sub –sub-lang en,en-US)
D. Output our videos into the video subfolder and title the videos with their YouTube title. (-o “videos%(title)s.%(ext)s”)
D (alt). Depending on the channel, you may want to have the date in the file name. Add that to the above command if so. ( -o “videos%(upload_date)s-%(title)s.%(ext)s”)
E. Download in the best possible quality. (-f bestvideo+bestaudio/best)
./yt-dlp -a good.txt --merge-output-format "mkv" --write-sub --sub-lang en,en-US -o "videos\%(title)s.%(ext)s" -f bestvideo+bestaudio/best
Run it and confirm what’s happening is expected. You should see .mkv video files and .vtt subtitles appear in the video subdirectory.
Once youtube-dl finishes running through your list of videos, you are all set! Backup this command too into a text file called download.txt.
Step 6 (optional): Checking for new subtitles
The cool thing about doing it this way is that as long as we keep our sub directory intact, we have a built-in system to check for videos that have had subtitles added since we ran our command (weeks, months, or years later).
- Rename the scan.txt file to scan.bat and download.txt to download.bat and now you can run these scripts just by double clicking them.
- Run scan.bat to scan for new subtitles. Check the sub directory - it will not overwrite the existing subtitle files, only download the subtitle files of videos it doesn’t have, so sorting this directory by Date Modified easily allows you to only select new subtitles.
- Follow the rest of Step 4 - Create a list of video IDs that have subtitles above to update your good.txt file (backup the original good.txt first by renaming it originalvideolist.txt).
- Check for a youtube-dl update at the official website and replace the exe in your directory.
- Run download.bat to download the new videos.