[python3]wikipediaのページを文字化けせずに読み込む

python3.xの環境で、wikipediaのページを読み込み、htmlの内容を文字化けせずに画面に出力させる方法です。

今回はcchardetという文字コードの自動認識を行えるpipパッケージを使ってみます。

パッケージのインストール

php3 install cchardet

wikipediaのhtmlを読み込んで出力するスクリプト

cchardetのdetect()メソッドを使って文字コードを判定しています。

import urllib.request
import cchardet

url = 'https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E8%AA%9E'
with urllib.request.urlopen(url) as res:
    html = res.read()
    encode = cchardet.detect(html)['encoding']
    print(html.decode(encode))

実行結果

以下のように文字化けさせずに画面出力できました

python sample.py | grep "<title>"
<title>日本語 - Wikipedia</title>

パッケージのインストール

wikipediaのhtmlを読み込んで出力するスクリプト

実行結果

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル