ArcGIS Blog

Analytics

ArcGIS Pro

Easily connect and run spatial analysis on new data sources in ArcGIS Pro

By Sarah Ambrose

Imagine running spatial analysis without all the initial time-consuming work to convert and prepare your data, such as:  

  • Combining multiple files into a single file for analysis 
  • Formatting your time and geometry fields before running analysis 
  • Making a table with Longitude and Latitude columns into a feature class using  the XY Table to Point tool 

Sounds great, right? With ArcGIS Pro 2.6, big data connections (BDCs) make this a realityBDCs allow you to connect to collections of datasets without lengthy pre-processing steps to prepare your data for analysis. Big data not required! In this blog, I’m going to give an overview of big data connections, how to create themand show you an example of how to get started with data you can download and use yourself. 

What is a big data connection?

A big data connection is a reference to a folder containing one or more datasets. The BDC contains information about  included datasets, such as the dataset names, the schema, and how time and geometry are represented. The source folder being referenced by the BDC has a folder for each dataset.

Source folder containing three folders. Each of the three folders represents a dataset.
Source folder containing three folders. Each of the three folders represents a dataset.

Each dataset can have one or more files. The files in a single dataset folder must be the same file type with the same schema. There can even be folders within the folders – the structure within the dataset folder doesn’t matter! 

Source folder with expanded dataset folders.
Source folder with expanded dataset folders.

When analysis is run using a dataset from a BDC, all the contents in the dataset folder are used in the analysis. For example, when Dataset3 in the image above is selected as an inputthe files D3-1 and D3-2 are both used in analysis. 

How do I create a big data connection?

Before creating a BDC, you need to ensure your data is formatted in the correct folder structure: 

  • There is a single source folder. In the above example, this is “Source-Folder.” 
  • There are one or more folders for each dataset within the source folder. 
  • Files contained within the dataset folder have the same file type and schema. 

Once your data is properly structured, you can run the Create Big Data Connection tool, which will register a reference to your datasets (stored in a .bdc file). In this tool, specify where the output BDC file will reside. I like to store mine in a folder that I have saved as a favorite in my projects. Specify a name for the BDC file, and the source location. This tool, like all the others GeoAnalytics Desktop Tools, requires an Advanced license. 

When the tool runs, it’s scanning through your datasets to determine the schema, and looking to see if there are any fields that can be used for time or geometry. When it completes, it creates a .bdc file that has these and more properties as references to the datasets. The datasets discovered as part of the BDC are left as-is, they are not copied or moved in any wayThe BDC datasetcan be used for visualization and analysis.  

Now what?

Now that you have a BDC file, you can use it in your favorite analytic workflows or add it to the map to visualize your features. 

Example of folder structure for a BDC source folder
BDC source folder, BDC-Example, with a folder representing the dataset Storm-Event-Details. Storm-Event-Details is composed of 5 folders, each containing a CSV.

BDCs provide a powerful new way for Pro to connect and interact with your data. To see how BDCs can fit into your favorite workflows, let’s look at an example. 

Creating and using big data connections: An Example

I’ve downloaded 5 years of storm data from the NOAA Storm Events Database and unzipped the files. I now have a folder called BDC-Example that has a folder named Storm-Event-Details. The Storm-Event-Details folder will represent a single dataset that contains 5 folders, each of which has a CSV. Now, I can register the BDC-Example folder as a BDC, and I have a new dataset called Storm-Event-Details 

To register the dataset, I run Create Big Data Connection.  

Create big data connection tool
Create Big Data Connection tool. Specify where you want to store your BDC, the name, and the source folder.

This creates a BDC with a dataset that I can add to my map or use in analysis. Before I use this dataset for analysisI’m going to look at the field values in my datasetI’ll use this information to ensure that essential information like time and geometry are correctly configured for my BDC dataset.  

To do this, I run the Preview Dataset From Big Data Connection tool. I provide an input BDC dataset and the output is a table in the geoprocessing messages. This is especially useful when using a large dataset that is too big to open.  

Preview Dataset From Big Dataset Connection
Run the Preview Dataset From Big Data Connection to see the first 10 records in your dataset.
Sample of preview values
A sample of some fields in the preview table, including fields BEGIN_LAT and BEGIN_LON which will be used to represent the location, and BEGIN_DATE_TIME which will be used to represent the time.

From this preview I see a few things: 

  • I have a lot of fields (see that scroll bar?)I’m not interested in using all of them in analysis.  
  • I have a field named BEGIN_DATE_TIME in the format dd-MMM-yy HH:mm:ss. You can see supported time formats here 
  • I have fields named BEGIN_LAT and BEGIN_LON that can be used to represent the X, Y location of each feature. 

Given the above information, I want to update my big data connection to hide fields I’m not interested in using, and make sure that the time and geometry on this dataset use the fields and formats I’ve outlined above.  

To do this, I use the Update Big Data Connection Dataset Properties tool. This tool allows you to change how a dataset is representedOnce I’ve picked my dataset to update, I’ll make changein the FieldsGeometry, and Time sections.  

In the Fields section, I uncheck the Show checkbox for fields that don’t have information that I’m interested in using in analysisI’m only interested in keeping the EPISODE_IDEVENT_ID, STATE, EVENT_TYPE, BEGIN_DATE_TIME, BEGIN_LAT, BEGIN_LON, and EVENT_NARRATIVE fields so I uncheck all others. This doesn’t delete anything from the underlying data (no BDC actions will ever delete or modify the existing datasets!), but simply hides the field for use in visualization or analysis.  

I want to make sure the correct fields are being used to represent geometry. In the Geometry section, I verify that the BEGIN_LAT and BEGIN_LON fields are used for Y, X, and they are – so no changes there.  

The Time section outlines the date time fields and formats to be used in analysis. Here I see that the BDC is using a different field than the one I want to use, so I simply pick the field I’m interested in and include the formatting of in the input time field, in this case: dd-MMM-yy HH:mm:ss 

 

Run Update
Running Update Big Data Connection Dataset Properties. In this example I hide three fields for the dataset, verify the geometry formatting is correct, and modify the time fields and formatting for my dataset.

Now that I have made all the changes, I run the Update Big Data Connection Dataset Properties tool and my BDC dataset is updated to use my changes. Want to be sure? Run the Describe Dataset tool. This will produce a summary table of fields, as well as a summary of the geometry and time. To check if time is correct – since it can be difficult to know if your time format is correct – check the time section in the Describe Dataset messages:

Describe Dataset to verify time and geometry is registered correctly
The messages in the Describe Dataset tool show that time is correctly registered on the dataset.

Here we can see that we don’t have any empty time values. If we made a mistake in registration, the time values would show up as empty, and you wouldn’t see a temporal extent.  

Now the data is ready for analysis!

Running Aggregate Points with a BDC dataset and aggregating into time steps.

chose to run one of my favorite toolsAggregate Points, for a quick understanding of my data and how it looks over time. Other ideas include analysis from the following blog by Kevin Butler that uses the same data. If you want to copy your data to another source, like a file geodatabase or shapefile, use the tool Copy Dataset from Big Data Connection 

With big data connections you can save time and resources preparing your dataset for analysis. Instead of having to merge datasets into a single dataset, converting time or geometry fields, you can use your data as isRememberdon’t be fooled by the name, big data connections are for all sizes of datasets.  

Share this article

Subscribe
Notify of
0 Comments
Oldest
Newest
Inline Feedbacks
View all comments