问题描述:

I would like to schedule and distribute on several machines - Windows or Ubuntu - (one task is only on one machine) the execution of R scripts (using RServe for instance).

I don't want to reinvent the wheel and would like to use a system that already exists to distribute these tasks in an optimal manner and ideally have a GUI to control the proper execution of the scripts.

1/ Is there a R package or a library that can be used for that?

2/ One library that seems to be quite widely used is mapReduce with Apache Hadoop.

I have no experience with this framework. What installation/plugin/setup would you advise for my purpose?

Edit: Here are more details about my setup:

I have indeed an office full of machines (small servers or workstations) that are sometimes also used for other purpose. I want to use the computing power of all these machines and distribute my R scripts on them.

I also need a scheduler eg. a tool to schedule the scripts at a fix time or regularly.

I am using both Windows and Ubuntu but a good solution on one of the system would be sufficient for now.

Finally, I don't need the server to get back the result of scripts. Scripts do stuff like accessing a database, saving files, etc, but do not return anything. I just would like to get back the errors/warnings if there are some.

网友答案:

If what you are wanting to do is distribute jobs for parallel execution on machines you have physical access to, I HIGHLY recommend the doRedis backend for foreach. You can read the vignette PDF to get more details. The gist is as follows:

Why write a doRedis package? After all, the foreach package already has available many parallel back end packages, including doMC, doSNOW and doMPI. The doRedis package allows for dynamic pools of workers. New workers may be added at any time, even in the middle of running computations. This feature is relevant, for example, to modern cloud computing environments. Users can make an economic decision to \turn on" more computing resources at any time in order to accelerate running computations. Similarly, modernThe doRedis Package cluster resource allocation systems can dynamically schedule R workers as cluster resources become available

Hadoop works best if the machines running Hadoop are dedicated to the cluster, and not borrowed. There's also considerable overhead to setting up Hadoop which can be worth the effort if you need the map/reduce algo and distributed storage provided by Hadoop.

So what, exactly is your configuration? Do you have an office full of machines you're wanting to distribute R jobs on? Do you have a dedicated cluster? Is this going to be EC2 or other "cloud" based?

The devil is in the details, so you can get better answers if the details are explicit.

If you want the workers to do jobs and have the results of the jobs reconfigured back in one master node, you'll be much better off using a dedicated R solution and not a system like TakTuk or dsh which are more general parallelization tools.

网友答案:

Look into TakTuk and dsh as starting points. You could perhaps roll your own mechanism with pssh or clusterssh, though these may be more effort.

相关阅读:
Top