I want to normalize the values in one column of a pandas dataframe based on the value in another column. It's not a pure normalization in a statistical sense. The second value is a type; I want to sum all the first values for each type, then, in each row, divide the value by the total for the type of that row. An example should make this clearer.

``df = pd.read_table(datafile, names = ["A", "B", "value", "type"])A B value type0 A1 B1 1 type11 A2 B2 1 type12 A1 B1 1 type23 A1 B3 1 type34 A2 B2 1 type25 A2 B4 1 type36 A3 B4 1 type27 A3 B5 1 type38 A4 B6 1 type29 A4 B7 1 type3``

Then I can find the sums with something like:

``types = df.groupby(["type"])["value"].sum()typetype1 2type2 4type3 4Name: value, dtype: int64``

Then how do I use this to normalize the value in each row?

I can use a loop like this to calculate the normalized values:

``norms = []for ix, row in df.iterrows():norms.append(row["value"]/types[row["type"]])``

And then replace the column with a new one with these values:

``df["value"] = pd.Series(norms)A B value type0 A1 B1 0.50 type11 A2 B2 0.50 type12 A1 B1 0.25 type23 A1 B3 0.25 type34 A2 B2 0.25 type25 A2 B4 0.25 type36 A3 B4 0.25 type27 A3 B5 0.25 type38 A4 B6 0.25 type29 A4 B7 0.25 type3``

But as I understand, using loops like this is not very efficient or proper, and there is likely a way to do it using some standard pandas functions.

Thanks.

You could use `transform`, which performs an operation on each group and then expands the result back up to match the original index. For example"

``````>>> df["value"] /= df.groupby("type")["value"].transform(sum)
>>> df
A   B  value   type
0  A1  B1   0.50  type1
1  A2  B2   0.50  type1
2  A1  B1   0.25  type2
3  A1  B3   0.25  type3
4  A2  B2   0.25  type2
5  A2  B4   0.25  type3
6  A3  B4   0.25  type2
7  A3  B5   0.25  type3
8  A4  B6   0.25  type2
9  A4  B7   0.25  type3
``````

because we have

``````>>> df.groupby("type")["value"].transform(sum)
0    2
1    2
2    4
3    4
4    4
5    4
6    4
7    4
8    4
9    4
dtype: int64
``````

I think the best way to achieve this is by using the `.apply()` method on your groupby object:

``````# Using backslashes for explicit line continuation, not seen
#   that often in Python but useful in pandas when you're
#   chaining a lot of methods one after the other
df['value_normed'] = df.groupby('type', group_keys=False)\
.apply(lambda g: g['value'] / g['value'].sum())
df
Out[9]:
A   B  value   type  value_normed
0  A1  B1      1  type1          0.50
1  A2  B2      1  type1          0.50
2  A1  B1      1  type2          0.25
3  A1  B3      1  type3          0.25
4  A2  B2      1  type2          0.25
5  A2  B4      1  type3          0.25
6  A3  B4      1  type2          0.25
7  A3  B5      1  type3          0.25
8  A4  B6      1  type2          0.25
9  A4  B7      1  type3          0.25
``````

You need the `group_keys=False` argument so that `type` doesn't become the index for each group's data, which prevents you from matching the transformed values back up to your original dataframe easily.

Top