问题描述:

I have a set of 50 million text snippets and I would like to create some clusters out of them. The dimensionality might be somewhere between 60k-100k. The average text snippet length would be 16 words. As you can imagine, the frequency matrix would be pretty sparse. I am looking for a software package / libray / sdk that would allow me to find those clusters. I had tried CLUTO in the past but this seems a very heavy task for CLUTO. From my research online I found that BIRCH is an algorithm that can handle such problems, but, unfortunately, I couldn't find any BIRCH implementation software online (I only found a couple of ad-hoc implementations, like assignment projects, that lacked any sort of documentation whatsoever). Any suggestions?

网友答案:

You may be interested to checkout the Streaming EM-tree algorithm that uses the TopSig representation. Both are these are from my Ph.D. thesis on the topic of large scale document clustering.

We recently clustered 733 million documents on a single 16-core machine (http://ktree.sf.net). It took about 2.5 days to index the documents and 15 hours to cluster them.

The Streaming EM-tree algorithm can be found at https://github.com/cmdevries/LMW-tree. It works with binary document vectors produced by TopSig which can be found at http://topsig.googlecode.com.

I wrote a blog post about a similar approach earlier at http://chris.de-vries.id.au/2013/07/large-scale-document-clustering.html. However, the EM-tree scales better for parallel execution and also produces better quality clusters.

If you have any questions please feel free to contact me at [email protected]

网友答案:

My professor made this implementation of BIRCH Algorithm in Java. It is easy to read with some inline comments.

网友答案:

Try it with the graph partition algorithm. It may help you to make clustering on high dimensional data possible.

网友答案:

I suppose you're rather looking for something like all-pairs search.

This will give you pairs of similar records up to desired threshold. You can use bits of graph theory to extract clusters afterwards - consider each pair an edge. Then extracting connected components will give you something like single-linkage clustering, cliques will give you complete linkage clusters.

I just found implementation of BIRCH in C++.

相关阅读:
Top