问题描述:

I read the paper k-means++: The Advantages of Careful Seeding and didn't quite understand the algorithm provided which is:

"Let D(x) denote the shortest distance from a data point x to the closest center we have already chosen.

1a. Choose an initial center c1 uniformly at random

from X.

1b. Choose the next center ci, selecting ci = x' ∈ X with probability (D(x')^2) / Sum_of(D(x)^2)

1c. Repeat Step 1b until we have chosen a total of k centers.

2-4. Proceed as with the standard k-means algorithm"

(Better look at the algorithm in the link above)

Especially step 1b. What do they mean by "selecting ci = x' ∈ X with probability (D(x')^2) / Sumof(D(x)^2)". Do they mean selecting the element that has the largest proportion ? And how can perform such computation can lead to select the optimal centroids ?

The function **D(x)** is defined for all points x ∈ X.

In step 1b, we must choose some point to be a new center. We will choose randomly between all points (that are not already centers). But we will not give an equal chance to every point; we will assign different probabilities to different points before we choose. These probabilities must add up to 1.

Consider D(x)^2. We can evaluate this at every point, and add up the values: Sum_of(D(x)^2).

Then we can assign each point x' a probability equal to D(x')^2 / Sum_of(D(x)^2). These probabilities add up to 1, and give a better chance to points far away from all existing centers.

http://en.wikipedia.org/wiki/K-means%2B%2B

```
1. Choose one center uniformly at random from among the data points.
2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
4. Repeat Steps 2 and 3 until k centers have been chosen.
5. Now that the initial centers have been chosen, proceed using standard k-means clustering.
```