问题描述:

I have csv file, I want to create histogram from column 6. Using Linux utilities this is simple:

`└──> cut -f6 -d, data.csv | sort | uniq -c | sort -k2,2n`

563 0.0

72 0.025

35 0.05

22 0.075

14 0.1

21 0.125

14 0.15

10 0.175

5 0.2

3 0.225

7 0.25

3 0.275

6 0.3

5 0.325

3 0.35

1 0.375

3 0.4

1 0.425

3 0.45

3 0.475

5 0.5

7 0.525

11 0.55

3 0.575

4 0.6

3 0.625

11 0.65

5 0.675

9 0.7

5 0.725

7 0.75

8 0.775

5 0.8

3 0.825

3 0.85

4 0.875

2 0.9

1 0.925

1 0.975

109 1.0

But I would like to plot it using `gnuplot`

my attempt was to modify following script that I've found. This is my modified version:

`#!/usr/bin/gnuplot -p`

# http://psy.swansea.ac.uk/staff/carter/gnuplot/gnuplot_frequency.htm

clear

reset

set datafile separator ",";

# set term dumb

set key off

set border 3

# Add a vertical dotted line at x=0 to show centre (mean) of distribution.

set yzeroaxis

# Each bar is half the (visual) width of its x-range.

set boxwidth 0.05 absolute

set style fill solid 1.0 noborder

bin_width = 0.1;

bin_number(x) = floor(x/bin_width)

rounded(x) = bin_width * ( bin_number(x) + 0.5 )

# MAKE BINS

# plot dataset_path using (rounded($6)):(6) smooth frequency with boxes

# DO NOT MAKE BINS

plot "data.csv" using 6:6 smooth frequency with boxes

This is the result:

this http://oi57.tinypic.com/x1acrm.jpg

It is saying something completely different than Unix tools. In `gnuplot`

I've seen various types of histograms, e.g. some follows normal distribution pattern, others were ordered according to frequency (as if I replace the last `sort -k2,2n`

with `sort -n`

) another were ordered according to numbers from which histogram was created (mine case), etc. it would be nice if I could choose.

`smooth frequency`

renders the data monotonic in x (i.e. the value given in the first `using`

column, in your case the numerical value from column 6), and then sums up all y-values (the values given in the second `using`

column).

Here you also give the the sixth column, which is wrong if you want to count the number of occurrences of each distinct value in the sixth column, use `using 6:(1)`

, i.e. the numerical value `1`

in the second column, to count the actual number of occurrences of each value:

```
set style fill solid noborder
set boxwidth 0.8 relative
set datafile separator ','
plot 'nupic_out.csv' using 6:(1) smooth frequency with boxes notitle
```

To apply a logscale to the smoothed data, you must first save them to a temporary file with `set table ...; plot`

and then plot this temporary file.

```
set datafile separator ','
set table 'tmp.dat'
plot 'nupic_out.csv' using 6:(1) smooth frequency with lines
unset table
```

Here you must pay attention, because a bug in gnuplot adds a wrong last line to the output file which you must skip. You can either skip this by a filter in the `using`

statement with e.g.

```
plot 'tmp.dat' using (strcol(3) eq "i" ? $1 : 1/0):2 with boxes
```

which works fine here, or you could use `head`

to cut the last two lines like

```
plot '< head -n-2 tmp.dat' using 1:2 with boxes
```

Another point to note is, that gnuplot always uses white spaces to write out its data files, so you must change the data file separator back to `whitespace`

before plotting `tmp.dat`

.

A full working script could be

```
set style fill solid noborder
set boxwidth 0.8 relative
set datafile separator ','
set table 'tmp.dat'
plot 'nupic_out.csv' using 6:(1) smooth frequency with lines notitle
unset table
set datafile separator whitespace
set logscale y
set yrange [0.8:*]
set autoscale xfix
plot '< head -n-2 tmp.dat' using 1:2 with boxes notitle
```

Now, using a binning function for the values in the sixth column, you must replace the `6`

in `using 6:(1)`

by an function which operates on the value given in the sixth column. This function must be enclosed in () and you reference the current value in the sixth column using `$6`

inside the function, like

```
plot 'nupic_out.csv' using (bin($6)):(1) smooth frequency with lines
```

Again, a full working script, using ChrisW's binning function could be

```
set style fill solid noborder
set datafile separator ','
set boxwidth 0.09 absolute
Min = -0.05
Max = 1.05
n = 11.0
width = (Max-Min)/n
bin(x) = width*(floor((x-Min)/width)+0.5) + Min
set table 'tmp.dat'
plot 'nupic_out.csv' using (bin($6)):(1) smooth frequency with lines notitle
unset table
set datafile separator whitespace
set logscale y
set xrange [-0.05:1.05]
set tics nomirror out
plot '< head -n-2 tmp.dat' using 1:2 with boxes notitle
```