Python & HDFS
Read and write data from HDFS using Python
Introduction
Python has a variety of modules wich can be used to deal with data, specially when we have to read from HDFS or write data into HDFS.
In this article we are facing two types of flat files, CSV and Parquet format.
1. Prerequisite
Note that is necessary to have Hadoop clients and the lib libhdfs.so in your machine.
2. CSV Format
One of the most popular module that can be used to read CSV file from an Kerberized HDFS Cluster is the hdfs module.
2.1 Read a CSV file from HDFS
After instantiating the HDFS client, invoke the read_csv() function of the Pandas module to load the CSV file.
from hdfs.ext.kerberos import KerberosClient
import pandas as pd
hdfs_client = KerberosClient('http://ip_hadoop_master','OPTIONAL')
with hdfs_client.read('path/to/csv') as reader:
df = pd.read_csv(reader, sep=';', error_bad_lines=False)
2.2. Write CSV format into HDFS
Let’s have an example of Pandas Dataframe.
After instantiating the HDFS client, use the write() function to write this Pandas Dataframe into HDFS with CSV format.
from hdfs.ext.kerberos import KerberosClient
import pandas as pdliste_hello = ['hello1','hello2']
liste_world = ['world1','world2']
df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world})hdfs_client = KerberosClient('http://ip_hadoop_master','OPTIONAL')
with hdfs_client.write('/path/to/csv', encoding='utf-8') as writer:
df.to_csv(writer)
3. Parquet format
We will use Pyarrow module to read or write Parquet file format from an Kerberized HDFS Cluster.
3.1 Read a Parquet from HDFS
When we want to read the Parquet format, either we will find a single Parquet file or a set of Parquet blocks under a folder.
3.1.1 Single Parquet file
After instantiating the HDFS client, invoke the read_table() function to read this Parquet file.
import pyarrow as pa
import pyarrow.parquet as pqhdfs = pa.hdfs.connect('hostname_hadoop_master/ip_hadoop_master', port=50070, kerb_ticket='/tmp/krb5cc_1000')with hdfs.open('path/to/parquet/file', 'rb') as f:
df = pq.read_table(f, filesystem=hdfs)
3.1.2 Dataset Parquet
After instantiating the HDFS client, use the parquetDataset() function to read these blocks of parquet and convert the loaded table into Pandas Dataframe.
import pyarrow as pa
import pyarrow.parquet as pqhdfs = pa.hdfs.connect('hostname_hadoop_master/ip_hadoop_master', port=50070, kerb_ticket='/tmp/krb5cc_1000')#read a directory of parquet files
dataset = pq.ParquetDataset('path/to/directory', filesystem=hdfs)
table = dataset.read()
df = table.to_pandas()
3.2 Write Parquet format into HDFS
Let’s have an example of Pandas Dataframe.
After instantiating the HDFS client, use the write_table() function to write this Pandas Dataframe into HDFS with Parquet format.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pqliste_hello = ['hello1','hello2']
liste_world = ['world1','world2']
df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world})hdfs = pa.hdfs.connect('hostname_hadoop_master/ip_hadoop_master', port=50070, kerb_ticket='/tmp/krb5cc_1000')
adf = pa.Table.from_pandas(df)with hdfs.open('path/to/parquet/file', "wb") as writer:
pq.write_table(adf, writer)
Conclusion
These Python functions are usefull when we have to deal with data that is stored in HDFS and avoid holding data from HDFS before operating data.