Python & HDFS

3 min readDec 15, 2020

Read and write data from HDFS using Python

Introduction

Python has a variety of modules wich can be used to deal with data, specially when we have to read from HDFS or write data into HDFS.

In this article we are facing two types of flat files, CSV and Parquet format.

1. Prerequisite

Note that is necessary to have Hadoop clients and the lib libhdfs.so in your machine.

2. CSV Format

One of the most popular module that can be used to read CSV file from an Kerberized HDFS Cluster is the hdfs module.

2.1 Read a CSV file from HDFS

After instantiating the HDFS client, invoke the read_csv() function of the Pandas module to load the CSV file.

from hdfs.ext.kerberos import KerberosClient
import pandas as pd

hdfs_client = KerberosClient('http://ip_hadoop_master','OPTIONAL')
with hdfs_client.read('path/to/csv') as reader:
   df = pd.read_csv(reader, sep=';', error_bad_lines=False)

2.2. Write CSV format into HDFS

Let’s have an example of Pandas Dataframe.

After instantiating the HDFS client, use the write() function to write this Pandas Dataframe into HDFS with CSV format.

from hdfs.ext.kerberos import KerberosClient
import pandas as pdliste_hello = ['hello1','hello2']
liste_world = ['world1','world2']
df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world})hdfs_client = KerberosClient('http://ip_hadoop_master','OPTIONAL')
with hdfs_client.write('/path/to/csv', encoding='utf-8') as writer:
    df.to_csv(writer)

3. Parquet format

We will use Pyarrow module to read or write Parquet file format from an Kerberized HDFS Cluster.

3.1 Read a Parquet from HDFS

When we want to read the Parquet format, either we will find a single Parquet file or a set of Parquet blocks under a folder.

3.1.1 Single Parquet file

After instantiating the HDFS client, invoke the read_table() function to read this Parquet file.

import pyarrow as pa
import pyarrow.parquet as pqhdfs = pa.hdfs.connect('hostname_hadoop_master/ip_hadoop_master', port=50070, kerb_ticket='/tmp/krb5cc_1000')with hdfs.open('path/to/parquet/file', 'rb') as f:
    df = pq.read_table(f, filesystem=hdfs)

3.1.2 Dataset Parquet

After instantiating the HDFS client, use the parquetDataset() function to read these blocks of parquet and convert the loaded table into Pandas Dataframe.

import pyarrow as pa
import pyarrow.parquet as pqhdfs = pa.hdfs.connect('hostname_hadoop_master/ip_hadoop_master', port=50070, kerb_ticket='/tmp/krb5cc_1000')#read a directory of parquet files
dataset = pq.ParquetDataset('path/to/directory', filesystem=hdfs)
table = dataset.read()
df = table.to_pandas()

3.2 Write Parquet format into HDFS

Let’s have an example of Pandas Dataframe.

After instantiating the HDFS client, use the write_table() function to write this Pandas Dataframe into HDFS with Parquet format.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pqliste_hello = ['hello1','hello2']
liste_world = ['world1','world2']
df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world})hdfs = pa.hdfs.connect('hostname_hadoop_master/ip_hadoop_master', port=50070, kerb_ticket='/tmp/krb5cc_1000')
adf = pa.Table.from_pandas(df)with hdfs.open('path/to/parquet/file', "wb") as writer:
    pq.write_table(adf, writer)

Conclusion

These Python functions are usefull when we have to deal with data that is stored in HDFS and avoid holding data from HDFS before operating data.

Python & HDFS

Introduction

1. Prerequisite

2. CSV Format

2.1 Read a CSV file from HDFS

2.2. Write CSV format into HDFS

3. Parquet format

3.1 Read a Parquet from HDFS

3.1.1 Single Parquet file

3.1.2 Dataset Parquet

3.2 Write Parquet format into HDFS

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Mohammed Tahiri-amine

No responses yet

Recommended from Medium

Understanding RDDs in PySpark: The Backbone of Distributed Computing 🚀🔥

PySpark, a framework for big data processing, has revolutionised the way we handle massive datasets. At its core lies the Resilient…

Apache Spark Architecture :A Deep Dive into Big Data Processing

Agenda

Lists

Predictive Modeling w/ Python

Coding & Development

Practical Guides to Machine Learning

ChatGPT prompts

Advanced Apache Airflow Patterns: Retry & Failover, SLA Monitoring, and Sensor Tasks

Note: Non-members can read the full article here

Performance Tuning — How we brought Down the join runtime from 10+ Hours to under 30 Minutes in…

Scenario🔍

How I processed ONE billion rows in PySpark without crashing (and You Can Too!)

Ever tried running a PySpark job on 1 billion rows, only to watch it crash and burn?

Exploring GraphFrames in PySpark

Graph processing is increasingly becoming important in today’s data-driven world, where relationships and connections between data points…