问题描述:

I have a dataset with time interval data. It looks something like this:

date | person | shift_start | shift_end | activity_start | activity_end | activity

10JAN | Joe | 8:00 | 16:00 | 10:00 | 11:00 | training

10JAN | Joe | 8:00 | 16:00 | 13:00 | 14:00 | meeting

11JAN | Joe | 8:00 | 16:00 | 8:00 | 11:00 | dragoning

11JAN | Joe | 8:00 | 16:00 | 13:00 | 14:00 | wizardry

What I want to do is go through the table and "fill in the gaps". For the above data I would want to add the following rows

date | person | shift_start | shift_end | activity_start | activity_end | activity

10JAN | Joe | 8:00 | 16:00 | 8:00 | 10:00 | default

10JAN | Joe | 8:00 | 16:00 | 11:00 | 13:00 | default

10JAN | Joe | 8:00 | 16:00 | 14:00 | 16:00 | default

11JAN | Joe | 8:00 | 16:00 | 11:00 | 13:00 | default

11JAN | Joe | 8:00 | 16:00 | 14:00 | 16:00 | default

As you can see, I need to add possibly many rows per date and person. I'm not sure whether rows can be added within a datastep in this fashion, where processing of the incoming data continues. Moreover, even if this is supported, I am not sure exactly how I can achieve what I want. This is what I was thinking:

data fill_gaps;

retain prev_date

prev_default_activity_end

prev_default_activity_start;

if(date <> prev_date) then do;

/*different shift than previous row */

if(shift_start = activity_start) then do;

/* the newly created activity start time should be the

activity end time of this row, but the new activity end

time cannot be determined without looking at the next row */

default_activity_start = activity_end;

else do;

/* the new activity start and end time can be determined */

default_activity_start = shift_start;

default_activity_end = activity_start;

end;

else do;

/* same shift as previous row */

default_activity_start = prev_default_activity_start;

default_activity_end = activity_start;

end;

prev_date = date;

prev_default_activity_end = default_activity_end;

prev_default_activity_start = default_activity_start;

run;

Then a few more data steps may be needed to extract rows where both default_activity_start and default_activity_end are populated, and append these rows (with the new columns) into the original table.

This seems hacky to me, and I haven't actually had a chance to test it yet (sorry, I know that sounds lazy!). Is there not a more elegant way to do this?

Gratitude!

网友答案:

Here's a solution using lag, assuming your input dataset is called test. This solution fills in the gaps and outputs the original rows, for further info, see the comments:

/* Must sort by person,date to use by-group processing */
proc sort data=test;
  by person date;
run;

data fill_gaps (drop=_:);
  _new_row=0;
  set test;

   /*Hold the previous activity end time*/  
   _laen=lag(activity_end);
   by person date;

/*Conditions such that a new row should be inserted */
if (shift_start < activity_start and first.date) or
_laen < activity_start
or (last.date and activity_end < shift_end) then do;

   /* Output current row */
   output;
   /*Build interim row and output */
   activity_end = activity_start;
   if first.date then activity_start = shift_start;
   else activity_start = _laen;
   activity = 'default';
   _new_row=1;
   output;

    /* If we get to the last date - output the end record */ 
   if last.date and activity_end < shift_end then do;
      activity_start = activity_end;
      activity_end = shift_end;
      output;
    end;
end;
else output;
run;

The output will not be sorted properly, you'll need to sort by person, date, activity_start.

There are interim variables used, all prefixed with underscore. To drop these from the output dataset, remove the comments around drop=_: in the dataset option.

网友答案:

Hash object is the most convenient workshop to manipulate data, forward or backward, based on key search or not. Here is a Hash option. For every data step iteration, it uploads the current row onto Hash, after analyzing and filling the gap, it download the row and output. No sort needed if the data comes in as is.

data have;
    infile cards dlm='|';
    input  (date   person) (:$8.)  (shift_start  shift_end  activity_start  activity_end ) (:time8.) activity :$20.;
    format shift_start  shift_end  activity_start  activity_end :time8.;
    cards;
10JAN | Joe    | 8:00        | 16:00     | 10:00          | 11:00        | training
10JAN | Joe    | 8:00        | 16:00     | 13:00          | 14:00        | meeting
11JAN | Joe    | 8:00        | 16:00     | 8:00           | 11:00        | dragoning
11JAN | Joe    | 8:00        | 16:00     | 13:00          | 14:00        | wizardry
;

data want;
    if _n_=1 then
        do;
            dcl hash h();
            h.definekey('person');
            h.definedata('date','person', 'shift_start', 'shift_end',  'activity_start',  'activity_end', 'activity');
            h.definedone();
        end;

    set have;
    by person date notsorted;
    rc=h.add();
    lag_end=lag(activity_end);

    if first.date and shift_start < activity_start then
        do;
            activity_end=activity_start;
            activity_start=shift_start;
            activity='default';
            output;
        end;
    else if lag_end < activity_start then
        do;
            activity_end=activity_start;
            activity_start=lag_end;
            activity='default';
            output;
        end;

    rc=h.find();
    output;
    rc=h.clear();

    if last.date and activity_end < shift_end then
        do;
            activity_start=activity_end;
            activity_end=shift_end;
            activity='default';
            output;
        end;

    drop rc lag_end;
run;
相关阅读:
Top