Hi guys, I would like to get your perspective on my case. I have many jobs that I need to run on apache beam and I need to schedule them with Airflow. Is it better to use dataflow and Cloud Composer or buy a VM on compute engine and run my job through python (apache beam) and install my airflow in that VM?
Best answer by deokView original
I’m moving your question to the Infrastructure group.
From the two options you give us, I would choose the first one. But let's see what others will suggest.
@TomNom , welcome to the forum. Great Question!
@ilias already mentioned an option and tagged some well known members of the community that can help.
I’ll reach some Customer Engineers at Google so they can add other opinions to your question.
Keep you posted!
@TomNom, I’m a customer engineer with Google Cloud. You will save time and energy using serverless(fully-managed) solutions like Dataflow, and Composer.
A bit of backstory on this is that in January 2016 Google began the process of fully committing to open-source by donating the Dataflow SDK and the associated computation model to the Apache Software Foundation as Apache Beam. This means that Dataflow as a product is an Apache Beam runner but goes further in providing you with the convenience of not worrying about the infrastructure. Your performance will be better in Dataflow, and you won’t have to manage any infrastructure.
The same point is made for Airflow and Cloud Composer. Both of these tools will allow you to focus on the value-added part, the data processing.
Follow this doc to learn more about launching Dataflow pipelines with Composer: https://cloud.google.com/composer/docs/how-to/using/using-dataflow-template-operator
have you read this article? It will help you choose between the two.