[NLP, python3]collections.Counter()を使って、単語の出現頻度を集計する

pythonでは、collections.Counter()を使って配列の各要素が何回出てきたかの度数分布を簡単に求めることができます。

自然言語処理(NLP)なので単語の出現頻度(word frequency)を求めたい場合、この機能を使うと便利です。

プログラム

import collections

# 処理する元の単語
words = ['word1', 'word2', 'word3', 'word1', 'word3', 'word4', 'word1' ]
print(words)


# 単語ごとに出現頻度を求める
freq = collections.Counter(words)
print("--------------")
print(freq)
print(type(freq))


# 単語ごとに回数を列挙する
print("--------------")
for key, count in freq.items():
    print(key, count)


# 出現頻度が高いトップNを出力する
print("--------------")
for key, count in freq.most_common(2):
    print(key, count)


# word3を削除して、もう一回列挙する
del freq['word3']
print("--------------")
for key, count in freq.items():
    print(key, count)

実行結果

実行結果は以下の通りです。
collections.Counter()メソッドの戻り値が配列はなく、collections.Counterクラスのオブジェクトであることが以外なところですが、配列と同じように列挙させることもできるので特に意識せずにプログラムできます。

$ python count.py
['word1', 'word2', 'word3', 'word1', 'word3', 'word4', 'word1']

--------------
Counter({'word1': 3, 'word3': 2, 'word2': 1, 'word4': 1})
<class 'collections.Counter'>

--------------
word1 3
word2 1
word3 2
word4 1

--------------
word1 3
word3 2

--------------
word1 3
word2 1
word4 1

Amazonでおトクに買い物する方法