[python]TfidfVectorizerで指定するnormパラメータの意味を理解する

pythonでtf-idf処理を行う時によく利用される、TfidfVectorizer()には,単語ベクトルを成果するためにnormパラメータというものがあります。

今回、このnormの役割を確認するために、ソースコードを確認してみました。

normパラメータの説明

公式のドキュメントによる、normパラメータの説明です

norm : ‘l1’, ‘l2’ or None, optional
  Norm used to normalize term vectors. None for no normalization.


norm : ‘l1’, ‘l2’ または Noneが指定可能, 省略化
  パラメータnormは、単語ベクトルを正規化するために使用されます
  normにNoneを指定すると正規化しません

コードリーディング

normパラメータはコンストラクタで指定するので、まずコンストラクタを確認します。

sklearn.feature_extraction.text.TfidfVectorizer::ctor()

class TfidfVectorizer(CountVectorizer):
    def __init__(self, input='content', encoding='utf-8',
                 decode_error='strict', strip_accents=None, lowercase=True,
                 preprocessor=None, tokenizer=None, analyzer='word',
                 stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
                 ngram_range=(1, 1), max_df=1.0, min_df=1,
                 max_features=None, vocabulary=None, binary=False,
                 dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True,
                 sublinear_tf=False):

        super(TfidfVectorizer, self).__init__(
            input=input, encoding=encoding, decode_error=decode_error,
            strip_accents=strip_accents, lowercase=lowercase,
            preprocessor=preprocessor, tokenizer=tokenizer, analyzer=analyzer,
            stop_words=stop_words, token_pattern=token_pattern,
            ngram_range=ngram_range, max_df=max_df, min_df=min_df,
            max_features=max_features, vocabulary=vocabulary, binary=binary,
            dtype=dtype)

        self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf,
                                       smooth_idf=smooth_idf,
                                       sublinear_tf=sublinear_tf)

コンストラクタの中では、self._tfidfというインスタンス変数を用意しており、TfidfTransformerクラスのインスタンスがセットされています。normパラメータは、TfidfTransformerのコンストラクタに渡されています。

sklearn.feature_extraction.text.TfidfTransformer::ctor()

class TfidfTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, norm='l2', use_idf=True, smooth_idf=True,
                 sublinear_tf=False):
        self.norm = norm
        self.use_idf = use_idf
        self.smooth_idf = smooth_idf
        self.sublinear_tf = sublinear_tf

呼び出されていたTfidfTransformer()のコンストラクタを見ると、こちらでもnorm変数はインスタンス変数にセットされているだけです。クラス内でnormを参照している箇所を確認したところ、実際の処理を行うtransform()関数でしたので、この関数を確認します。

sklearn.feature_extraction.text.TfidfTransformer::transform()

from ..preprocessing import normalize


class TfidfTransformer(BaseEstimator, TransformerMixin):
    def transform(self, X, copy=True):
        if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.floating):
            # preserve float family dtype
            X = sp.csr_matrix(X, copy=copy)
        else:
            # convert counts or binary occurrences to floats
            X = sp.csr_matrix(X, dtype=np.float64, copy=copy)

        n_samples, n_features = X.shape

        if self.sublinear_tf:
            np.log(X.data, X.data)
            X.data += 1

        if self.use_idf:
            check_is_fitted(self, '_idf_diag', 'idf vector is not fitted')

            expected_n_features = self._idf_diag.shape[0]
            if n_features != expected_n_features:
                raise ValueError("Input has n_features=%d while the model"
                                 " has been trained with n_features=%d" % (
                                     n_features, expected_n_features))
            # *= doesn't work
            X = X * self._idf_diag

        if self.norm:
            X = normalize(X, norm=self.norm, copy=False)

        return X

transform()関数では処理の最後でself.normが使用されています。self.normがNoneで無ければ("if self.normが"Trueなら)、normalize()関数をコールしています。

normalize関数がどこに定義されているのかというと、先頭に"from ..preprocessing import normalize"とあり、sklearn.preprocessingパッケージのnormalize()が実態であることがわかります。

sklearn.preprocessing.normalize

from ..utils.extmath import row_norms

def normalize(X, norm='l2', axis=1, copy=True, return_norm=False):
    if norm not in ('l1', 'l2', 'max'):
        raise ValueError("'%s' is not a supported norm" % norm)

    if axis == 0:
        sparse_format = 'csc'
    elif axis == 1:
        sparse_format = 'csr'
    else:
        raise ValueError("'%d' is not a supported axis" % axis)

    X = check_array(X, sparse_format, copy=copy,
                    estimator='the normalize function', dtype=FLOAT_DTYPES)
    if axis == 0:
        X = X.T

    if sparse.issparse(X):
        if return_norm and norm in ('l1', 'l2'):
            raise NotImplementedError("return_norm=True is not implemented "
                                      "for sparse matrices with norm 'l1' "
                                      "or norm 'l2'")
        if norm == 'l1':
            inplace_csr_row_normalize_l1(X)
        elif norm == 'l2':
            inplace_csr_row_normalize_l2(X)
        elif norm == 'max':
            _, norms = min_max_axis(X, 1)
            norms_elementwise = norms.repeat(np.diff(X.indptr))
            mask = norms_elementwise != 0
            X.data[mask] /= norms_elementwise[mask]
    else:
        if norm == 'l1':
            norms = np.abs(X).sum(axis=1)
        elif norm == 'l2':
            norms = row_norms(X)
        elif norm == 'max':
            norms = np.max(X, axis=1)
        norms = _handle_zeros_in_scale(norms, copy=False)
        X /= norms[:, np.newaxis]

    if axis == 0:
        X = X.T

    if return_norm:
        return X, norms
    else:
        return X

normalize（）では、sparse.issparse(X)の結果によって処理が分かれています。

これは、渡されたパラメータが疎行列の形式か否かをチェックしているのですが、この分岐は本質ではないので処理がわかりやすい疎行列ではない方を追っていくことにします。

問題の部分だけを抜粋すると、以下のようになっています。

norm=l1の時は見ての通り、絶対値の合計をとっているだけでした。

norm=l2の時は、row_norms()関数に処理させています。ファイルの先頭で"from ..utils.extmath import row_norms"と定義されているので、sklearn.utils.extmath.row_norms()が実態であることがわかります。

        if norm == 'l1':
            norms = np.abs(X).sum(axis=1)
        elif norm == 'l2':
            norms = row_norms(X)
        elif norm == 'max':
            norms = np.max(X, axis=1)

(また今回の調査とは関係ありませんが、どうやらnormパラメータはl1, l2に加えてmaxも指定可能らしいということがわかりました)

sklearn.utils.extmath.row_norms()

def row_norms(X, squared=False):
    """Row-wise (squared) Euclidean norm of X.
    Equivalent to np.sqrt((X * X).sum(axis=1)), but also supports sparse
    matrices and does not create an X.shape-sized temporary.
    Performs no input validation.
    """
    if issparse(X):
        if not isinstance(X, csr_matrix):
            X = csr_matrix(X)
        norms = csr_row_norms(X)
    else:
        norms = np.einsum('ij,ij->i', X, X)

    if not squared:
        np.sqrt(norms, norms)
    return norms

こちらがrow_norms()の中身です。コメントにあるように np.sqrt((X * X).sum(axis=1))と同じ処理を行うということがわかりました。

疎行列の場合は、実際には下記の２行を実行させています。

norms = np.einsum('ij,ij->i', X, X)
np.sqrt(norms, norms)