CloudSufi

Data Engineer

Gautam Buddha Nagar, India
Python GCP Cloud Storage Cloud SQL Cloud Run Dataflow Pub/Sub BigQuery Apigee Git SQL Apache Beam SPARQL Schema.org CI/CD Cloud Build Cloud Data Fusion API RDF JSON-LD
Description

SE- Data Engineering

Location: -, India; Gautam Buddha Nagar, India

Experience: 4-5

Data Engineer 

Position Type: Full-time
About Us
CLOUDSUFI, a Google Cloud Premier Partner, is a global leading provider of data-driven digital transformation across cloud-based enterprises. With a global presence and focus on Software & Platforms, Life sciences and Healthcare, Retail, CPG, financial services, and supply chain, CLOUDSUFI is positioned to meet customers where they are in their data monetization journey.
Job Summary
We are seeking a highly skilled and motivated Data Engineer to join our Development POD for the Integration Project. The ideal candidate will be responsible for designing, building, and maintaining robust data pipelines to ingest, clean, transform, and integrate diverse public datasets into our knowledge graph. This role requires a strong understanding of Cloud Platform (GCP) services, data engineering best practices, and a commitment to data quality and scalability.

Key Responsibilities
ETL Development: Design, develop, and optimize data ingestion, cleaning, and transformation pipelines for various data sources (e.g., CSV, API, XLS, JSON, SDMX) using Cloud Platform services (Cloud Run, Dataflow) and Python.
Schema Mapping & Modeling: Work with LLM-based auto-schematization tools to map source data to our schema.org vocabulary, defining appropriate Statistical Variables (SVs) and generating MCF/TMCF files.
Entity Resolution & ID Generation: Implement processes for accurately matching new entities with existing IDs or generating unique, standardized IDs for new entities.
Knowledge Graph Integration: Integrate transformed data into the Knowledge Graph, ensuring proper versioning and adherence to existing standards. 
API Development: Develop and enhance REST and SPARQL APIs via Apigee to enable efficient access to integrated data for internal and external stakeholders.
Data Validation & Quality Assurance: Implement comprehensive data validation and quality checks (statistical, schema, anomaly detection) to ensure data integrity, accuracy, and freshness. Troubleshoot and resolve data import errors.
Automation & Optimization: Collaborate with the Automation POD to leverage and integrate intelligent assets for data identification, profiling, cleaning, schema mapping, and validation, aiming for significant reduction in manual effort.
Collaboration: Work closely with cross-functional teams, including Managed Service POD, Automation POD, and relevant stakeholders.
Qualifications and Skills
Education: Bachelor's or Master's degree in Computer Science, Data Engineering, Information Technology, or a related quantitative field.
Experience: 3+ years of proven experience as a Data Engineer, with a strong portfolio of successfully implemented data pipelines.
Programming Languages: Proficiency in Python for data manipulation, scripting, and pipeline development.
Cloud Platforms and Tools: Expertise in Google Cloud Platform (GCP) services, including Cloud Storage, Cloud SQL, Cloud Run, Dataflow, Pub/Sub, BigQuery, and Apigee. Proficiency with Git-based version control.
Core Competencies:
Must Have - SQL, Python, BigQuery, (GCP DataFlow / Apache Beam), Google Cloud Storage (GCS)
Must Have - Proven ability in comprehensive data wrangling, cleaning, and transforming complex datasets from various formats (e.g., API, CSV, XLS, JSON)
Secondary Skills - SPARQL, Schema.org, Apigee, CI/CD (Cloud Build), GCP, Cloud Data Fusion, Data Modelling
Solid understanding of data modeling, schema design, and knowledge graph concepts (e.g., Schema.org, RDF, SPARQL, JSON-LD).
Experience with data validation techniques and tools.
Familiarity with CI/CD practices and the ability to work in an Agile framework.
Strong problem-solving skills and keen attention to detail.
Preferred Qualifications:

Experience with LLM-based tools or concepts for data automation (e.g., auto-schematization).
Familiarity with similar large-scale public dataset integration initiatives.
Experience with multilingual data integration.
CloudSufi
CloudSufi

0 applies

0 views

There are more than 50,000 engineering jobs:

Subscribe to membership and unlock all jobs

Engineering Jobs

60,000+ jobs from 4,500+ well-funded companies

Updated Daily

New jobs are added every day as companies post them

Refined Search

Use filters like skill, location, etc to narrow results

Become a member

🥳🥳🥳 452 happy customers and counting...

Overall, over 80% of customers chose to renew their subscriptions after the initial sign-up.

To try it out

For active job seekers

For those who are passive looking

Cancel anytime

Frequently Asked Questions

  • We prioritize job seekers as our customers, unlike bigger job sites, by charging a small fee to provide them with curated access to the best companies and up-to-date jobs. This focus allows us to deliver a more personalized and effective job search experience.
  • We've got over 200,000 jobs from 15,000+ vetted companies. No fake or sleazy jobs here!
  • We aggregate jobs from 15,000+ companies' career pages, so you can be sure that you're getting the most up-to-date and relevant jobs.
  • We're the only job board *for* software engineers, *by* software engineers… in case you needed a reminder! We add thousands of new jobs daily and offer powerful search filters just for you. 🛠️
  • Every single hour! We add 2,000-3,000 new jobs daily, so you'll always have fresh opportunities. 🚀
  • Typically, job searches take 3-6 months. EchoJobs helps you spend more time applying and less time hunting. 🎯
  • Check daily! We're always updating with new jobs. Set up job alerts for even quicker access. 📅

What Fellow Engineers Say