The increasing data has given a rise to the number of professionals who can draw valuable insights from it. Data engineer is one of the most popular positions in companies and is crucial to the analytics team. Data analysts and other roles are often confused with data engineer roles, but the latter is usually involved in building infrastructure or framework necessary for data generation. They work on the architecture aspect of data, like data collection, data storage, and data management, among others.
Having said this, every company may have its own definition of what a data engineer, the hiring process remains largely the same and so does the interview questions. If you are applying for a data engineer role, these are the most likely questions that you might be asked:
General Questions
What are the different types of design schemas in data modeling?
- There are two schemas in data modeling: Star schema and the other is Snowflake Schema.
How is the Hadoop database different from the traditional Relational Database Management System?
- The Hadoop database is a column-oriented database which has a flexible schema to add columns on the fly. It is equipped with sparse tables with tight integration of MR (market research) and horizontal scalability, very efficient for semi-structured and unstructured data.
- RDMS is designed for the row-oriented databases with a fixed schema. It is optimized for joins and not for sparse tables. Not having integration with MR makes another major difference from Hadoop. RDBMS is preferred for the structured data
Elaborate on Hadoop distributed file system
- Hadoop can work directly with any scalable distributed file system such as Local FS, HFTP FS, S3 FS, and others, but the most common file system used by Hadoop is the HDFS
- The Hadoop Distributed File System is built on the Google File System (GFS) and contribute a distributed file system that is designed to run on large clusters (thousands of computers) of small computer machines in a definitive and accurate manner.
- HDFS uses a master/slave architecture where master consists of a single NameNode that manages the file system metadata and one or more slave DataNodes that store the actual data.
How data analytics and big data can boost business revenue
- Using data in an efficient to ensure the business growth
- Maximizing the customer value
- Cutting down the cost production of the company
- Turning analytical to improve staffing levels forecasts
Technical Questions: Get Set Sode
Data science has an in-depth coding involved which requires the programming knowledge of various languages such as python, java. Statistical software as R programming. Database systems like Hadoop. Testing tools of ETL and task automation platforms like Powershell. Here are a few questions asked on these topics.
Python
Name a few well-known python packages
- Pandas: It’s A package which provides adaptable data structures for working with relational or labeled data.
- NumPy: A package which grants you to work with numerical based data structures
- Matplotlib: Its A 2D rendering engine written for especially for Python.
- Tensorflow: its A package used for developing computational graphs.
What are Lambda functions?
Lambda functions are functions without a name. We can define a function and use it as a lambda function. It can be understood by the below example.
g=lambda z :z*2
a=g(5)
Print (a)
##5*2=10(out put)
What is meant by *args and **kwargs?
When a function is ordered its known as *args. The unordered arguments used in a function are called as **kwarg. To understand better we will see an example.
def total_cost(number=1, price_per_unit=1):
return number * price_per_unit
total_cost(number=10, price_per_unit=12)
total_cost(price_per_unit=12, number=10)
The arguments number and price_per_unit are kwargs are optional arguments and can be reversed
when arguments cannot be inverted those are known as *args. We will see an example for these *args.
def square_area(side):
return side*side
square_area(5)
##25(output)
What is the difference between list and tuples? Give examples.
- Lists can be defined as mutable, that is, they can be edited. For example, list_1=[‘naren’,123,’india’]
- Tuples can be defined as immutable (tuples are lists which can’t be edited). Eg:list_1=(‘india’,100,’virat’)
R Programming
How can a .csv file be loaded in R?
How do you install a package in R?
Mention some widely used packages for data mining in R?
- data.table- this package contributes for throughout examination of large files.
- rpart and caret- these packages benefit in machine learning prototypes
- Arules- used for association rule learning.
- ggplot- maintains distinct data visualization plots.
- tm- help in performing text mining.
- Forecast- implement functions for time series analysis
Hadoop Database
What are the main methods of a Reducer?
- setup(): this method is used for configuring various parameters like input data size, distributed cache.
public void setup (context)
- reduce(): a heart of the reducer always called once per key with the associated reduced task
public void reduce(Key, Value, context)
- cleanup(): this method is called to clean temporary files, only once at the end of the task
public void cleanup (context)
Mention the various schedules in a Hadoop framework.
- COSHH (a classification and optimization based schedule for heterogeneous Hadoop systems) – is a scheduler which examines heterogeneity at both the application and cluster degree.
- FIFO Scheduler –in FIFO scheduling, a jobbing reporter picks jobs from a work queue, oldest job first.
- Fair Sharing scheduler-in a fair share scheduling the goal is to assign resources to jobs such that on mean time, each job obtains an equal share of the accessible resources.
Microsoft PowerShell
Explain what is the importance of brackets in PowerShell?
- Parenthesis Brackets (): Curved parenthesis style brackets are used for mandatory arguments.
- Braces Brackets {}: Curly brackets are used in blocked statements
- Square Brackets []: They define arbitrary items, and they are not used frequently.
Mention the three ways that PowerShell uses to ‘Select’
- The most familiar and widely used way is the Wmiobject technique, in this technique we use ‘-query’ to introduce a classic ‘Select * from’ a phrase
- The second widely used method used for ‘Select’ in PowerShell is Select-String. Which completely checks for a word, phrase or any pattern match.
- The third way is Select-Object.
Conclusion
Getting a data engineer post is tough but not impossible. With numerous complications associated with collecting and managing data, this field is now hosting to a wide array of jobs and designations. Having the ability to integrate knowledge, skill and analytical approach is essential. It’s not just about data science; it’s about having the ability to transform that data into visualization. Your strategy will only be as good as the data, so take the time to graduate with skills required to be a data engineer whom employers will want to hire.