Connect to dataframe data
A dataframe is a set of data that resides in-memory and is represented in your code by a variable to which it is assigned. To connect to this in-memory data you will define a Data Source based on the type of dataframe you are connecting to, a Data Asset that connects to the dataframe in question, and a Batch Definition that will return all of the records in the dataframe as a single Batch of data.
Create a Data Source
Because the dataframes reside in memory you do not need to specify the location of the data when you create your Data Source. Instead, the type of Data Source you create depends on the type of dataframe containing your data. Great Expectations has methods for connecting to both pandas and Spark dataframes.
Prerequisites
- Python version 3.9 to 3.12
- 
An installation of GX Core
- Optional. To connect to data with Spark you will also need an installation of the Python dependencies for Spark.
 
- A preconfigured Data Context.  These examples assume the variable contextcontains your Data Context.
Procedure
- Instructions
- Sample code
- 
Define the Data Source parameters. A dataframe Data Source requires the following information: - name: A name by which to reference the Data Source. This should be unique among all Data Sources on the Data Context.
 Update data_source_namein the following code with a descriptive name for your Data Source:Pythondata_source_name = "my_data_source"
- 
Create the Data Source. To read a pandas dataframe you will need to create a pandas Data Source. Likewise, to read a Spark dataframe you will need to create a Spark Data Source. - pandas
- Spark
 Execute the following code to create a pandas Data Source: Pythondata_source = context.data_sources.add_pandas(name=data_source_name)Execute the following code to create a Spark Data Source: Pythondata_source = context.data_sources.add_spark(name=data_source_name)
- pandas
- Spark
import great_expectations as gx
# Retrieve your Data Context
context = gx.get_context()
# Define the Data Source name
data_source_name = "my_data_source"
# Add the Data Source to the Data Context
data_source = context.data_sources.add_pandas(name=data_source_name)
import great_expectations as gx
# Retrieve your Data Context
context = gx.get_context()
# Define the Data Source name
data_source_name = "my_data_source"
# Add the Data Source to the Data Context
data_source = context.data_sources.add_spark(name=data_source_name)
Create a Data Asset
A dataframe Data Asset is used to group your Validation Results. For instance, if you have a data pipeline with three stages and you wanted the Validation Results for each stage to be grouped together, you would create a Data Asset with a unique name representing each stage.
Prerequisites
- Python version 3.9 to 3.12
- 
An installation of GX Core
- Optional. To connect to data with Spark you will also need an installation of the Python dependencies for Spark.
 
- A preconfigured Data Context.  These examples assume the variable contextcontains your Data Context.
- A pandas or Spark dataframe Data Source.
Procedure
- Instructions
- Sample code
- 
Optional. Retrieve your Data Source. If you do not already have a variable referencing your pandas or Spark Data Source, you can retrieve a previously created one with: Pythondata_source_name = "my_data_source"
 data_source = context.data_sources.get(data_source_name)
- 
Define the Data Asset's parameters. A dataframe Data Asset requires the following information: - name: A name by which the Data Asset can be referenced. This should be unique among Data Assets on the Data Source.
 Update the data_asset_nameparameter in the following code with a descriptive name for your Data Asset:Pythondata_asset_name = "my_dataframe_data_asset"
- 
Add a Data Asset to the Data Source. Execute the following code to add a Data Asset to your Data Source: Pythondata_asset = data_source.add_dataframe_asset(name=data_asset_name)
import great_expectations as gx
context = gx.get_context()
# Retrieve the Data Source
data_source_name = "my_data_source"
data_source = context.data_sources.get(data_source_name)
# Define the Data Asset name
data_asset_name = "my_dataframe_data_asset"
# Add a Data Asset to the Data Source
data_asset = data_source.add_dataframe_asset(name=data_asset_name)
Create a Batch Definition
Typically, a Batch Definition is used to describe how the data within a Data Asset should be retrieved. With dataframes, all of the data in a given dataframe will always be retrieved as a Batch.
This means that Batch Definitions for dataframe Data Assets don't work to subdivide the data returned for validation. Instead, they serve as an additional layer of organization and allow you to further group your Validation Results. For example, if you have already used your dataframe Data Assets to group your Validation Results by pipeline stage, you could use two Batch Definitions to further group those results by having all automated validations use one Batch Definition and all manually executed validations use the other.
If you use GX Cloud and GX Core together, note that Batch Definitions you create with the API apply to API-managed Expectations only.
Prerequisites
- Python version 3.9 to 3.12
- 
An installation of GX Core
- Optional. To connect to data with Spark you will also need an installation of the Python dependencies for Spark.
 
- A preconfigured Data Context.  These examples assume the variable contextcontains your Data Context.
- A pandas or Spark dataframe Data Asset.
Procedure
- Instructions
- Sample code
- 
Optional. Retrieve your Data Asset. If you do not already have a variable referencing your pandas or Spark Data Asset, you can retrieve a previously created Data Asset with: Pythondata_source_name = "my_data_source"
 data_asset_name = "my_dataframe_data_asset"
 data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
- 
Define the Batch Definition's parameters. A dataframe Batch Definition requires the following information: - name: A name by which the Batch Definition can be referenced. This should be unique among Batch Definitions on the Data Asset.
 Because dataframes are always provided in their entirety, dataframe Batch Definitions always use the add_batch_definition_whole_dataframe()method.Update the value of batch_definition_namein the following code with something that describes your dataframe:Pythonbatch_definition_name = "my_batch_definition"
- 
Add the Batch Definition to the Data Asset. Execute the following code to add a Batch Definition to your Data Asset: Pythonbatch_definition = data_asset.add_batch_definition_whole_dataframe(
 batch_definition_name
 )
import great_expectations as gx
context = gx.get_context()
# Retrieve the Data Asset
data_source_name = "my_data_source"
data_asset_name = "my_dataframe_data_asset"
data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
# Define the Batch Definition name
batch_definition_name = "my_batch_definition"
# Add a Batch Definition to the Data Asset
batch_definition = data_asset.add_batch_definition_whole_dataframe(
    batch_definition_name
)
Provide a dataframe through Batch Parameters
Because dataframes exist in memory and cease to exist when a Python session ends the dataframe itself is not saved as part of a Data Assset or Batch Definition. Instead, a dataframe created in the current Python session is passed in at runtime as a Batch Parameter dictionary.
Prerequisites
- Python version 3.9 to 3.12
- 
An installation of GX Core
- Optional. To connect to data with Spark you will also need an installation of the Python dependencies for Spark.
 
- A preconfigured Data Context.  These examples assume the variable contextcontains your Data Context.
- A Batch Definition on a pandas or Spark dataframe Data Asset.
- Data in a pandas or Spark dataframe.  These examples assume the variable dataframecontains your pandas or Spark dataframe.
- Optional. A Validation Definition.
Procedure
- 
Define the Batch Parameter dictionary. A dataframe can be added to a Batch Parameter dictionary by defining it as the value of the dictionary key dataframe:Pythonbatch_parameters = {"dataframe": dataframe}The following examples create a dataframe by reading a .csvfile and storing it in a Batch Parameter dictionary:- pandas
- Spark
 Pythonimport pandas
 csv_path = "./data/folder_with_data/yellow_tripdata_sample_2019-01.csv"
 dataframe = pandas.read_csv(csv_path)
 batch_parameters = {"dataframe": dataframe}Pythonfrom pyspark.sql import SparkSession
 csv = "./data/folder_with_data/yellow_tripdata_sample_2019-01.csv"
 spark = SparkSession.builder.appName("Read CSV").getOrCreate()
 dataframe = spark.read.csv(csv, header=True, inferSchema=True)
 batch_parameters = {"dataframe": dataframe}
- 
Pass the Batch Parameter dictionary to a get_batch()orvalidate()method call.Runtime Batch Parameters can be provided to the get_batch()method of a Batch Definition or to thevalidate()method of a Validation Definition.- Batch Definition
- Validation Definition
 The get_batch()method of a Batch Definition retrieves a single Batch of data. Runtime Batch Parameters can be provided to theget_batch()method to specify the data returned as a Batch. Thevalidate()method of this Batch can then be used to test individual Expectations.Pythonimport great_expectations as gx
 context = gx.get_context()
 # Retrieve the dataframe Batch Definition
 data_source_name = "my_data_source"
 data_asset_name = "my_dataframe_data_asset"
 batch_definition_name = "my_batch_definition"
 batch_definition = (
 context.data_sources.get(data_source_name)
 .get_asset(data_asset_name)
 .get_batch_definition(batch_definition_name)
 )
 # Create an Expectation to test
 expectation = gx.expectations.ExpectColumnValuesToBeBetween(
 column="passenger_count", max_value=6, min_value=1
 )
 # Get the dataframe as a Batch
 batch = batch_definition.get_batch(batch_parameters=batch_parameters)
 # Test the Expectation
 validation_results = batch.validate(expectation)
 print(validation_results)The results generated by batch.validate()are not persisted in storage. This workflow is solely intended for interactively creating Expectations and engaging in data Exploration.For further information on using an individual Batch to test Expectations see Test an Expectation. A Validation Definition's run()method validates an Expectation Suite against a Batch returned by a Batch Definition. Runtime Batch Parameters can be provided to a Validation Definition'srun()method to specify the data returned in the Batch. This allows you to validate your dataframe by executing the Expectation Suite included in the Validation Definition.Pythonimport great_expectations as gx
 context = gx.get_context()
 # Retrieve a Validation Definition that uses the dataframe Batch Definition
 validation_definition_name = "my_validation_definition"
 validation_definition = context.validation_definitions.get(validation_definition_name)
 # Validate the dataframe by passing it to the Validation Definition as Batch Parameters.
 validation_results = validation_definition.run(batch_parameters=batch_parameters)
 print(validation_results)For more information on Validation Definitions see Run Validations.