问题描述:

I am trying to adapt this code (source found here)to iterate through a directory of files, instead of having the input hard-coded.

#!/usr/bin/python

# -*- coding: utf-8 -*-

from __future__ import division, unicode_literals

import math

from textblob import TextBlob as tb

def tf(word, blob):

return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):

return sum(1 for blob in bloblist if word in blob)

def idf(word, bloblist):

return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):

return tf(word, blob) * idf(word, bloblist)

document1 = tb("""Today, the weather is 30 degrees in Celcius. It is really hot""")

document2 = tb("""I can't believe the traffic headed to the beach. It is really a circus out there.'""")

document3 = tb("""There are so many tolls on this road. I recommend taking the interstate.""")

bloblist = [document1, document2, document3]

for i, blob in enumerate(bloblist):

print("Document {}".format(i + 1))

scores = {word: tfidf(word, blob, bloblist) for word in blob.words}

sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)

for word, score in sorted_words:

score_weight = score * 100

print("\t{}, {}".format(word, round(score_weight, 5)))

I would like to use an an input txt files in a directory, rather than each hard-coded document.

For instance, imagine I had a directory foo which contains three files file1, file2, file3.

File 1 contains the contents that document1 contains, i.e.

file1:

Today, the weather is 30 degrees in Celcius. It is really hot

File 2 contains the contents that document2 contains, i.e.

I can't believe the traffic headed to the beach. It is really a circus out there.

File 3 contains the contents that document3 contains, i.e.

There are so many tolls on this road. I recommend taking the interstate.

I have though to use glob to achieve my desired result, and I have come up with the following code adapation, which correctly identifies the files, but does not process them individually, as the original code does:

file_names = glob.glob("/path/to/foo/*")

files = map(open,file_names)

documents = [file.read() for file in files]

[file.close() for file in files]

bloblist = [documents]

for i, blob in enumerate(bloblist):

print("Document {}".format(i + 1))

scores = {word: tfidf(word, blob, bloblist) for word in blob.words}

sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)

for word, score in sorted_words:

score_weight = score * 100

print("\t{}, {}".format(word, round(score_weight, 5)))

How can I maintain the scores for each individual file using glob?

The desired result after using the files in a directory as input would be the same as the original code [results truncuated to top 3 for space]:

Document 1

Celcius, 3.37888

30, 3.37888

hot, 3.37888

Document 2

there, 2.38509

out, 2.38509

headed, 2.38509

Document 3

on, 3.11896

this, 3.11896

many, 3.11896

A similar question here did not fully solve the problem. I was wondering how I can call the files to calculate the idf but maintain them separately for calculate the full tf-idf?

网友答案:

In your first code example you fill bloblist with results of tb(), and in your second example - with inputs for tb() (just strings).

Try to replace bloblist = [documents] with bloblist = map(tb, documents).

You can also sort filename list like this file_names = sorted(glob.glob("/path/to/foo/*")) to make outputs of both versions match.

网友答案:

I am not sure what it is exactly what you want to achieve. You could have an array and append the results to that array:

scores = []
bloblist = [documents]
for i, blob in enumerate(bloblist):
  ... do your evaluation ..
  scores.append(score_weight)

print scores
相关阅读:
Top