Python & HDFS

Mohammed Tahiri-amine
3 min readDec 15, 2020

Read and write data from HDFS using Python

HDFS and Python

Introduction

Python has a variety of modules wich can be used to deal with data, specially when we have to read from HDFS or write data into HDFS.

In this article we are facing two types of flat files, CSV and Parquet format.

1. Prerequisite

Note that is necessary to have Hadoop clients and the lib libhdfs.so in your machine.

2. CSV Format

One of the most popular module that can be used to read CSV file from an Kerberized HDFS Cluster is the hdfs module.

2.1 Read a CSV file from HDFS

After instantiating the HDFS client, invoke the read_csv() function of the Pandas module to load the CSV file.

from hdfs.ext.kerberos import KerberosClient
import pandas as pd

hdfs_client = KerberosClient('http://ip_hadoop_master','OPTIONAL')
with hdfs_client.read('path/to/csv') as reader:
df = pd.read_csv(reader, sep=';', error_bad_lines=False)

2.2. Write CSV format into HDFS

Let’s have an example of Pandas Dataframe.

After instantiating the HDFS client, use the write() function to write this Pandas Dataframe into HDFS with CSV format.

from hdfs.ext.kerberos import KerberosClient
import pandas as pd
liste_hello = ['hello1','hello2']
liste_world = ['world1','world2']
df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world})
hdfs_client = KerberosClient('http://ip_hadoop_master','OPTIONAL')
with hdfs_client.write('/path/to/csv', encoding='utf-8') as writer:
df.to_csv(writer)

3. Parquet format

We will use Pyarrow module to read or write Parquet file format from an Kerberized HDFS Cluster.

3.1 Read a Parquet from HDFS

When we want to read the Parquet format, either we will find a single Parquet file or a set of Parquet blocks under a folder.

3.1.1 Single Parquet file

After instantiating the HDFS client, invoke the read_table() function to read this Parquet file.

import pyarrow as pa
import pyarrow.parquet as pq
hdfs = pa.hdfs.connect('hostname_hadoop_master/ip_hadoop_master', port=50070, kerb_ticket='/tmp/krb5cc_1000')with hdfs.open('path/to/parquet/file', 'rb') as f:
df = pq.read_table(f, filesystem=hdfs)

3.1.2 Dataset Parquet

After instantiating the HDFS client, use the parquetDataset() function to read these blocks of parquet and convert the loaded table into Pandas Dataframe.

import pyarrow as pa
import pyarrow.parquet as pq
hdfs = pa.hdfs.connect('hostname_hadoop_master/ip_hadoop_master', port=50070, kerb_ticket='/tmp/krb5cc_1000')#read a directory of parquet files
dataset = pq.ParquetDataset('path/to/directory', filesystem=hdfs)
table = dataset.read()
df = table.to_pandas()

3.2 Write Parquet format into HDFS

Let’s have an example of Pandas Dataframe.

After instantiating the HDFS client, use the write_table() function to write this Pandas Dataframe into HDFS with Parquet format.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
liste_hello = ['hello1','hello2']
liste_world = ['world1','world2']
df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world})
hdfs = pa.hdfs.connect('hostname_hadoop_master/ip_hadoop_master', port=50070, kerb_ticket='/tmp/krb5cc_1000')
adf = pa.Table.from_pandas(df)
with hdfs.open('path/to/parquet/file', "wb") as writer:
pq.write_table(adf, writer)

Conclusion

These Python functions are usefull when we have to deal with data that is stored in HDFS and avoid holding data from HDFS before operating data.

--

--