トピックモデルで文書の距離計算

トピックモデルによるデータ解析

トピックモデルはもともと文書データの解析手法として提案された。 具体的な数式は最尤推定とかMAP推定、ベイズ推定とかの式がたくさん出てくるけど、サクッと試したいときはPythonのライブラリ使えたりする。

LDA (Latent Dirichlet Allocation) 潜在ディリクレ配分法

詳細はWikipedia読んでください。 僕の理解では、ある入力(文書でも購買履歴でも何かのベクトル値)が与えられたら、その入力を生成した単語からトピックを推定するモデルと言う印象。 Latent(潜在)というのはトピックは入力からも単語からもわかっていなくて、潜在的にもってる値という理解です。例えば、「弁護士」、「法律」、「裁判」という単語から「司法」というトピックが浮かび上がってくるみたいな。 これまた理解が間違ってるかもしれないので、詳しい人ご指摘ください。

PythonによるLDA

scikit-learnにはLDAが実装されていないので、gensimパッケージを利用します。 (なおscikit-learnのLDAは線形判別法(Linear Discriminant Analysis)という判別器なので全然別物とのこと。)

使い方は以下の通り。

from gensim import corpora, models, similarities

corpus = corpora.BleiCorpus('./data/ap/ap.dat', './data/ap/vocab.txt')
model = models.ldamodel.LdaModel(corpus, num_topics=100, id2word=corpus.id2word)

corpusに入力文書データを入れてる。第一引数に入力してるデータはこんな感じ。

186 0:1 6144:1 3586:2 3:1 4:1 1541:1 8:1 10:1 3927:1 12:7 4621:1 527:1 9232:1 1112:2 20:1 2587:1 6172:1 10269:2 37:1 42:1 3117:1 1582:1 1585:3 435:1 9268:3 571:2 60:1 61:1 63:2 64:2 5185:1 11:1 4683:1 590:2 1103:2 592:1 5718:1 1623:2 1624:4 89:2 6234:1 8802:1 1638:1 103:1 600:1 9404:1 106:1 3691:1 720:1 2672:1 113:1 2165:1 5751:1 123:3 1148:1 128:2 1670:2 4231:1 1167:1 144:1 147:1 149:7 3735:2 5272:2 1732:1 673:2 5282:1 27:1 1700:1 9893:2 166:1 167:1 173:1 174:1 2224:1 2248:1 372:2 186:1 4284:3 3450:2 117:2 203:1 2244:1 5320:1 201:1 4215:1 9932:2 207:2 208:5 8914:1 7898:1 733:2 1760:1 1744:1 744:1 234:1 1259:2 4287:1 7254:1 249:1 8311:1 5884:2 298:1 254:1 767:2 2304:1 4876:1 270:1 557:1 786:1 789:2 2331:1 287:1 5409:1 290:1 5923:1 2854:1 1834:3 303:1 3888:4 817:2 9523:1 334:1 1333:1 311:2 1855:1 1417:1 325:1 1870:7 1361:1 1362:1 6995:1 342:1 343:1 344:1 857:1 5469:2 351:5 1377:1 2402:1 487:1 884:1 885:1 890:1 4477:1 3455:1 1410:1 5099:1 4489:1 395:1 2570:1 152:1 404:1 1429:1 1430:1 3992:1 416:1 3491:1 2033:1 3499:1 429:1 3502:1 5040:1 433:2 1971:4 437:1 9667:2 322:1 7119:1 8656:1 1102:1 985:1 989:1 1840:1 2529:1 997:1 2022:2 4071:1 2536:1 10219:1 1517:1 1009:1 221:1 3059:1 500:1 511:1

1文書ごとの単語がこんな感じでベクトルで表現されてる。 実際の文書の例。

A 16-year-old student at a private Baptist school who allegedly killed one teacher and wounded another before firing into a filled classroom apparently ``just snapped,'' the school's pastor said. ``I don't know how it could have happened,'' said George Sweet, pastor of Atlantic Shores Baptist Church. ``This is a good, Christian school. We pride ourselves on discipline. Our kids are good kids.'' The Atlantic Shores Christian School sophomore was arrested and charged with first-degree murder, attempted murder, malicious assault and related felony charges for the Friday morning shooting. Police would not release the boy's name because he is a juvenile, but neighbors and relatives identified him as Nicholas Elliott. Police said the student was tackled by a teacher and other students when his semiautomatic pistol jammed as he fired on the classroom as the students cowered on the floor crying ``Jesus save us! God save us!'' Friends and family said the boy apparently was troubled by his grandmother's death and the divorce of his parents and had been tormented by classmates. Nicholas' grandfather, Clarence Elliott Sr., said Saturday that the boy's parents separated about four years ago and his maternal grandmother, Channey Williams, died last year after a long illness. The grandfather also said his grandson was fascinated with guns. ``The boy was always talking about guns,'' he said. ``He knew a lot about them. He knew all the names of them _ none of those little guns like a .32 or a .22 or nothing like that. He liked the big ones.'' The slain teacher was identified as Karen H. Farley, 40. The wounded teacher, 37-year-old Sam Marino, was in serious condition Saturday with gunshot wounds in the shoulder. Police said the boy also shot at a third teacher, Susan Allen, 31, as she fled from the room where Marino was shot. He then shot Marino again before running to a third classroom where a Bible class was meeting. The youngster shot the glass out of a locked door before opening fire, police spokesman Lewis Thurston said. When the youth's pistol jammed, he was tackled by teacher Maurice Matteson, 24, and other students, Thurston said. ``Once you see what went on in there, it's a miracle that we didn't have more people killed,'' Police Chief Charles R. Wall said. Police didn't have a motive, Detective Tom Zucaro said, but believe the boy's primary target was not a teacher but a classmate. Officers found what appeared to be three Molotov cocktails in the boy's locker and confiscated the gun and several spent shell casings. Fourteen rounds were fired before the gun jammed, Thurston said. The gun, which the boy carried to school in his knapsack, was purchased by an adult at the youngster's request, Thurston said, adding that authorities have interviewed the adult, whose name is being withheld pending an investigation by the federal Bureau of Alcohol, Tobacco and Firearms. The shootings occurred in a complex of four portable classrooms for junior and senior high school students outside the main building of the 4-year-old school. The school has 500 students in kindergarten through 12th grade. Police said they were trying to reconstruct the sequence of events and had not resolved who was shot first. The body of Ms. Farley was found about an hour after the shootings behind a classroom door.

最初の186は単語数で、0:1というのは"ID 0の単語が1回出現する"ということを表現している。

これが入力文書分存在する。 で、第2引数が単語IDが記載されたやつ。

i
new
percent
people
year
two
million
...

みたいな。

実際のトピック例はこんな感じ。

In [87]: model.print_topic(0)
Out[87]: u'0.006*year + 0.004*dixon + 0.004*people + 0.004*government + 0.004*i + 0.004*percent + 0.004*two + 0.004*new + 0.003*million + 0.003*united'

LDAによる文書の類似度測定

LDAで次元削減された文書の類似度は以下のように計算。

numpy行列に各文書のトピックの重みを保存。

topics = [model[c] for c in corpus]
dense = np.zeros( (len(topics), 100) float)

for ti, t in enumerate(topics):
    for tj, v in t:
        dense[ti, tj] = v

from scipy.spatial import distance

pairwise = distance.squareform(distance.pdist(dense))

# 対角行列を最大値+1にする
largest = pairwise.max()
for ti in range(len(topics)):
    dense[ti, ti] = largest + 1

#最も距離が近い文書IDを返す
def closest_to(doc_id):
    return pairwise[doc_id].argmin()

これでpairwiseに文書間の距離が出力される。

実際先ほどの文書と一番距離が近いのがこれ。

 The nation's biggest junk bond underwriter has had a change of heart about its practice of letting employees buy into its own deals, one week after being accused in Congress of favoring them over its customers. Drexel Burnham Lambert Inc. announced Wednesday that employees and partnerships formed by employees will no longer be permitted to purchase new issues of bonds underwritten by the firm. The investigations subcommittee of the House Energy and Commerce Committee, in hearings last week, released documents showing Drexel employees reaped huge profits by purchasing bonds underwritten by the firm and then quickly reselling them, in one case within 17 days. Employees were permitted to buy bonds during initial offerings of much-sought issues, even when the practice denied bonds to public customers, the documents showed. Drexel was instrumental in developing the market for the high-yield, high-risk junk bonds, which allow companies without a track record of earnings to obtain financing based on whether they have sufficient cash flow to pay off their debts. Junk bonds have been frequently used in corporate raiding. Of the $32 billion in junk bonds issued in 1986, 40 percent were underwritten by Drexel, according to a congressional report. Drexel chief executive Frederick Joseph, in a letter to Rep. John Dingell, D-Mich., chairman of both the committee and subcommittee, stuck by his assertions that the employee purchases were proper, but said the firm was worried about appearances. ``I indicated we supported such purchases; nonetheless public perception is important to financial institutions such as ours and we recognize that even the best of motives can be misunderstood,'' he said. Rep. Ron Wyden, D-Ore., a member of the subcommittee, said Drexel's letter leaves a number of questions unanswered, including whether Drexel will be able to trade for its own accounts and pass profits on to employees in the form of bonuses. He said he would also like to know if employee accounts will be able to trade in the secondary market, which is heavily influenced by the underwriting firm. Steven Anreder, a spokesman for Drexel, said the firm had no comment beyond its brief announcement. Dingell, in a statement released by an aide, called Drexel's decision ``a very good step,'' adding, ``It should lay to rest questions associated with the fairness and legality of that particular practice.'' However, the aide, Dennis Fitzgibbons, said Dingell has made no decision regarding his committee's ongoing investigation of Drexel. Dingell has been criticized by Rep. Thomas Bliley, R-Va., for subpoenaing Michael Milken, chief of Drexel's junk bond unit, to appear before the committee even though Milken had indicated in advance he would refuse to testify based on his constitutional protection from self-incrimination. When he refused to testify, Milken confirmed he is under investigation by a federal grand jury in New York, which is reported to be looking into insider trading allegations. Dingell said the firm's retreat on employee purchases ``indicates that the behavior of the committee was correct and fair in bringing these matters to light.'' Junk bonds pay higher interest rates than those issued by established companies and are considered riskier investments. They have been used by corporate raiders to finance unfriendly takeover attempts.

参考文献

実践 機械学習システム

実践 機械学習システム