Python & HDFS

3 min readDec 15, 2020

Read and write data from HDFS using Python

Introduction

Python has a variety of modules wich can be used to deal with data, specially when we have to read from HDFS or write data into HDFS.

In this article we are facing two types of flat files, CSV and Parquet format.

1. Prerequisite

Note that is necessary to have Hadoop clients and the lib libhdfs.so in your machine.

2. CSV Format

One of the most popular module that can be used to read CSV file from an Kerberized HDFS Cluster is the hdfs module.

2.1 Read a CSV file from HDFS

After instantiating the HDFS client, invoke the read_csv() function of the Pandas module to load the CSV file.

from hdfs.ext.kerberos import KerberosClient
import pandas as pd

hdfs_client = KerberosClient('http://ip_hadoop_master','OPTIONAL')
with hdfs_client.read('path/to/csv') as reader:
   df = pd.read_csv(reader, sep=';', error_bad_lines=False)

2.2. Write CSV format into HDFS

Let’s have an example of Pandas Dataframe.

After instantiating the HDFS client, use the write() function to write this Pandas Dataframe into HDFS with CSV format.

from hdfs.ext.kerberos import KerberosClient
import pandas as pdliste_hello = ['hello1','hello2']
liste_world = ['world1','world2']
df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world})hdfs_client = KerberosClient('http://ip_hadoop_master','OPTIONAL')
with hdfs_client.write('/path/to/csv', encoding='utf-8') as writer:
    df.to_csv(writer)

3. Parquet format

We will use Pyarrow module to read or write Parquet file format from an Kerberized HDFS Cluster.

3.1 Read a Parquet from HDFS

When we want to read the Parquet format, either we will find a single Parquet file or a set of Parquet blocks under a folder.

3.1.1 Single Parquet file

After instantiating the HDFS client, invoke the read_table() function to read this Parquet file.

import pyarrow as pa
import pyarrow.parquet as pqhdfs = pa.hdfs.connect('hostname_hadoop_master/ip_hadoop_master', port=50070, kerb_ticket='/tmp/krb5cc_1000')with hdfs.open('path/to/parquet/file', 'rb') as f:
    df = pq.read_table(f, filesystem=hdfs)

3.1.2 Dataset Parquet

After instantiating the HDFS client, use the parquetDataset() function to read these blocks of parquet and convert the loaded table into Pandas Dataframe.

import pyarrow as pa
import pyarrow.parquet as pqhdfs = pa.hdfs.connect('hostname_hadoop_master/ip_hadoop_master', port=50070, kerb_ticket='/tmp/krb5cc_1000')#read a directory of parquet files
dataset = pq.ParquetDataset('path/to/directory', filesystem=hdfs)
table = dataset.read()
df = table.to_pandas()

3.2 Write Parquet format into HDFS