[python]SAOテレビシリーズ(英語版)の字幕スクリプトをスクレイピングする(Sword Art Online)

animetranscripts.wikispaces.comというサイトでは、英語版が出ているアニメ作品の字幕スクリプトのテキストを有志の人が作成し、提供してくれています。

今回は、プログラムの勉強をかねて、このテキストを全部スクレイピングしてテキストファイルに出力してみました。

プログラム

今回作成したプログラムです

import os
import requests
from bs4 import BeautifulSoup

baseUrl = 'http://animetranscripts.wikispaces.com'


# 各話のテキストがあるURLを取得する
def getUrlList(urlToc):
    result = []
    response = requests.get(urlToc)

    soup = BeautifulSoup(response.content, 'html.parser')
    articleUrls = soup.find('div', id='content_view').find_all('a', class_='wiki_link')
    for url in articleUrls:
        result.append(baseUrl + url['href'])

    return result 

# テキストを取得する
def getTranscript(url):
    response = requests.get(url)

    soup = BeautifulSoup(response.content, 'html.parser')
    text = soup.select_one('#content_view').text
    return text


# 指定されたテキストをファイルに保存する
def saveToFile(text, fileName):
    f = open(fileName, 'w');
    f.write(text)
    f.close()


# メイン処理
def main(tocPage):
    if not os.path.exists('output'):
        os.mkdir('output')

    # 各話のURLリストを取得
    urlToc = baseUrl + tocPage
    urlList = getUrlList(urlToc)

    for index, url in enumerate(urlList):
        #　進捗を表示
        print(index, url)

        # 英語の台本テキストを取得
        text = getTranscript(url)

        # 取得したテキストをファイルに保存
        fileName = "output/text" + str(index+1) + '.txt'
        saveToFile(text, fileName)

main('/Sword+Art+Online')

実行結果

実行すると、何話のデータを処理しているか画面に出力されます。
実行時点ではSAOの1期と2期のデータがあったため、合わせて50話分のテキストが取得できました。

$ python scrape_sao.py
0 http://animetranscripts.wikispaces.com/Sword%20Art%20Online%3E1.%20The%20World%20of%20Swords
1 http://animetranscripts.wikispaces.com/Sword%20Art%20Online%3E2.%20Beater
2 http://animetranscripts.wikispaces.com/Sword%20Art%20Online%3E3.%20Red-Nosed%20Reindeer
3 http://animetranscripts.wikispaces.com/Sword%20Art%20Online%3E4.%20The%20Black%20Swordsman
....
48 http://animetranscripts.wikispaces.com/Sword%20Art%20Online%20II%3E023.Beginning%20of%20a%20Dream
49 http://animetranscripts.wikispaces.com/Sword%20Art%20Online%20II%3E024.Mother%27s%20Rosario

実行結果を見ると、無事50ファイル作成されていました。

$ ls output/
text1.txt       text14.txt      text19.txt      text23.txt      text28.txt      text32.txt      text37.txt      text41.txt      text46.txt      text50.txt
text10.txt      text15.txt      text2.txt       text24.txt      text29.txt      text33.txt      text38.txt      text42.txt      text47.txt      text6.txt
text11.txt      text16.txt      text20.txt      text25.txt      text3.txt       text34.txt      text39.txt      text43.txt      text48.txt      text7.txt
text12.txt      text17.txt      text21.txt      text26.txt      text30.txt      text35.txt      text4.txt       text44.txt      text49.txt      text8.txt
text13.txt      text18.txt      text22.txt      text27.txt      text31.txt      text36.txt      text40.txt      text45.txt      text5.txt       text9.txt

ランダムで、１ファイル除いみましたが、正しく取れているようです。
全部チェックした訳ではありませんが、構造は"話者->セリフ"の順で記載されているようです。
元ファイルの都合で、１行目の話者のところ(下記の例だとRosalia:)だけ先頭に余計な空白が入っています。

$ head -25 output/text4.txt

  Rosalia:
What are you talking about?
Why should we have to share these healing crystals with you?
Your pet lizard has healing powers, you know.

Silica:
But you never fight on the front line.
So, who are you to talk?
You don't need to use any crystals!

Rosalia:
Of course I do.
You're the popular one around here, Silica.
I can't expect the boys to help me when I need it.

Guy:
Oh, come on. It's not like that.

Silica:
Fine! I don't care.
You want my items, take 'em.
But I'm never going to team up with you again!
I know, there is a ton of other parties out there who'd love to have me on their side

Amazonでおトクに買い物する方法