问题描述:

I am trying to aggregate data from per second to per minute (taking the average of all values over that minute). My data is timestamped of the form:

date time[h:m:s-03:00],X,value

2016-05-29 15:50:33-03:00,X,561

2016-05-29 15:50:34-03:00,X,565

....

I have hacked my way to an answer with the following code, but I would like to find a better, cleaner way to do this:

awk -F: 'BEGIN{

foo_total=0;

foo=0; }

{

split($4,ar,",");

foo=ar[3];

split($3,a,"-");

if(a[1]~"00"){

first_foo=ar[3]}

if(a[1]!~"00"){

foo_total+=ar[3]}

foo_sum=(first_foo + foo_total);

foo_final=foo_sum/60

if(a[1]~"59") {printf $1":"$2","foo_final"\n"; foo_total=0; first_foo=0;} }' MyFile.csv

Your help is much appreciated!

网友答案:

perl based approach:

$ cat testdata
2016-05-29 15:50:33-03:00,X,561
2016-05-29 15:50:34-03:00,X,565

$ perl -E 'my %arr; my %len;
           while(<STDIN>){
               m/^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}):\d{2}-03:00,X,([0-9]*)/;
               $arr{$1} += $2; $len{$1}++;
           }
           foreach my $key (sort keys %arr) {
               print "$key-03:00,X,". ($arr{$key}/$len{$key}) . "\n";
           }' < testdata 

2016-05-29 15:50-03:00,X,563

Explanation:

  1. For each line, split the line in fields (m/... line) &
  2. Accumulate value & length in hashes named arr & len respectively.
  3. At the end, print the average.

You can probably skip the length part of logic & hard-code it to 60. But IMO, recording the actual number of entries per minute is safer.

网友答案:

Please look at the following awk code:

BEGIN { total = 0; DayHour="Empty"; Min="";}
{
    if((DayHour ~ $1) && (Min ~ $2))
    {
         split($4,a,",");
         total += a[3];
    }
    else
    {
          if(DayHour !~ "Empty")
              print DayHour":"Min, total / 60;
          DayHour=$1;
          Min=$2;
          split($4,a,",");
              total = a[3];
    }
}
END {print DayHour":"Min, total / 60; }

You run it as awk -F: -f progam <data_file>

相关阅读:
Top