adlfs2024.7.0
Published
Access Azure Datalake Gen1 with fsspec and dask
pip install adlfs
Package Downloads
Authors
Requires Python
>=3.8
Dependencies
- azure-core
<2.0.0,>=1.23.1
- azure-datalake-store
<0.1,>=0.0.46
- azure-identity
- azure-storage-blob
>=12.12.0
- fsspec
>=2023.12.0
- aiohttp
>=3.7.0
- sphinx
; extra == "docs"
- myst-parser
; extra == "docs"
- furo
; extra == "docs"
- numpydoc
; extra == "docs"
- pytest
; extra == "tests"
- docker
; extra == "tests"
- pytest-mock
; extra == "tests"
- arrow
; extra == "tests"
- dask
[dataframe]; extra == "tests"
Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage
Quickstart
This package can be installed using:
pip install adlfs
or
conda install -c conda-forge adlfs
The adl://
and abfs://
protocols are included in fsspec's known_implementations registry
in fsspec > 0.6.1, otherwise users must explicitly inform fsspec about the supported adlfs protocols.
To use the Gen1 filesystem:
import dask.dataframe as dd
storage_options={'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}
dd.read_csv('adl://{STORE_NAME}/{FOLDER}/*.csv', storage_options=storage_options)
To use the Gen2 filesystem you can use the protocol abfs
or az
:
import dask.dataframe as dd
storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}
ddf = dd.read_csv('abfs://{CONTAINER}/{FOLDER}/*.csv', storage_options=storage_options)
ddf = dd.read_parquet('az://{CONTAINER}/folder.parquet', storage_options=storage_options)
Accepted protocol / uri formats include:
'PROTOCOL://container/path-part/file'
'PROTOCOL://[email protected]/path-part/file'
or optionally, if AZURE_STORAGE_ACCOUNT_NAME and an AZURE_STORAGE_<CREDENTIAL> is
set as an environmental variable, then storage_options will be read from the environmental
variables
To read from a public storage blob you are required to specify the 'account_name'
.
For example, you can access NYC Taxi & Limousine Commission as:
storage_options = {'account_name': 'azureopendatastorage'}
ddf = dd.read_parquet('az://nyctlc/green/puYear=2019/puMonth=*/*.parquet', storage_options=storage_options)
Details
The package includes pythonic filesystem implementations for both Azure Datalake Gen1 and Azure Datalake Gen2, that facilitate interactions between both Azure Datalake implementations and Dask. This is done leveraging the intake/filesystem_spec base class and Azure Python SDKs.
Operations against both Gen1 Datalake currently only work with an Azure ServicePrincipal with suitable credentials to perform operations on the resources of choice.
Operations against the Gen2 Datalake are implemented by leveraging Azure Blob Storage Python SDK.
Setting credentials
The storage_options
can be instantiated with a variety of keyword arguments depending on the filesystem. The most commonly used arguments are:
connection_string
account_name
account_key
sas_token
tenant_id
,client_id
, andclient_secret
are combined for an Azure ServicePrincipal e.g.storage_options={'account_name': ACCOUNT_NAME, 'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}
anon
: boo, optional. The value to use for whether to attempt anonymous access if no other credential is passed. By default (None
), theAZURE_STORAGE_ANON
environment variable is checked. False values (false
,0
,f
) will resolve toFalse
and anonymous access will not be attempted. Otherwise the value foranon
resolves to True.location_mode
: valid values are "primary" or "secondary" and apply to RA-GRS accounts
For more argument details see all arguments for AzureBlobFileSystem
here and AzureDatalakeFileSystem
here.
The following environmental variables can also be set and picked up for authentication:
- "AZURE_STORAGE_CONNECTION_STRING"
- "AZURE_STORAGE_ACCOUNT_NAME"
- "AZURE_STORAGE_ACCOUNT_KEY"
- "AZURE_STORAGE_SAS_TOKEN"
- "AZURE_STORAGE_TENANT_ID"
- "AZURE_STORAGE_CLIENT_ID"
- "AZURE_STORAGE_CLIENT_SECRET"
The filesystem can be instantiated for different use cases based on a variety of storage_options
combinations. The following list describes some common use cases utilizing AzureBlobFileSystem
, i.e. protocols abfs
or az
. Note that all cases require the account_name
argument to be provided:
- Anonymous connection to public container:
storage_options={'account_name': ACCOUNT_NAME, 'anon': True}
will assume theACCOUNT_NAME
points to a public container, and attempt to use an anonymous login. Note, the default value foranon
is True. - Auto credential solving using Azure's DefaultAzureCredential() library:
storage_options={'account_name': ACCOUNT_NAME, 'anon': False}
will useDefaultAzureCredential
to get valid credentials to the containerACCOUNT_NAME
.DefaultAzureCredential
attempts to authenticate via the mechanisms and order visualized here. - Auto credential solving without requiring
storage_options
: SetAZURE_STORAGE_ANON
tofalse
, resulting in automatic credential resolution. Useful for compatibility with fsspec. - Azure ServicePrincipal:
tenant_id
,client_id
, andclient_secret
are all used as credentials for an Azure ServicePrincipal: e.g.storage_options={'account_name': ACCOUNT_NAME, 'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}
.
Append Blob
The AzureBlobFileSystem
accepts all of the Async BlobServiceClient arguments.
By default, write operations create BlockBlobs in Azure, which, once written can not be appended. It is possible to create an AppendBlob using mode="ab"
when creating and operating on blobs. Currently, AppendBlobs are not available if hierarchical namespaces are enabled.