So many Shape Files. Too much time to load them!
James Fee, in his Spatially Adjusted blog, recently remarked about ESRI Shape Files, “So antiquated, so limiting, so dangerous….” True. True. Yet I could also add, “so everywhere.” Love them or hate them, Shapefiles are everywhere. Therefore, they are probably here to stay, at least until we all retire to some tropical paradise.
At least in the US, just about everyone I know obtains Shapefiles from various Federal agencies, either directly from sites like Census.gov, or clearinghouses like Data.gov. Features like Census Blocks and Block Groups, Congressional Districts, Places, Townships, and Counties come to mind immediately as commonly used datasets. In the USA, because of our Federal system, much of the public data will be organized as individual state level files instead of one big US-wide file like you would normally expect from private suppliers. That means you will need to manage up to 56 files: one for each of the 50 states; one for the District of Columbia, and one each for the territories and possessions of Puerto Rico, Guam, Northern Mariana Islands, US Virgin Islands, and American Samoa.
Often in my line of work, we want the data in a Microsoft SQL Server database, with all states and territories in single database table. If you are used to loading Shapefiles into a Microsoft SQL Server table, then likely you are familiar with the Shape2SQL utility (unless you happen to have the budget for a commercial grade loader). With Shape2SQL, you can quickly and easily load a single a Shapefile into a database using its Windows dialog. As it happens, you can also run Shape2SQL in Command Line mode. Therefore, you can create an old fashioned Windows/DOS BAT file of multiple command lines, one per Shapefile to load.
We can look at an example using “Place” files (for the USA). “Places” are boundaries of features like cities and towns, or commonly recognized but unincorporated communities. Organizations often use them in point-in-polygon operations, to determine the enclosing city of an address. (You cannot use the city name in the last line of the mailing address for this purpose because that value reflects the Post Office that delivers the mail and not the actual city in which the address is located.)
In the BAT file, if you are loading all 56 state files, you’ll have 56 command lines that mostly look like the following string of text:
“c:\Program Files (x86)\Shape2SQL\shape2sql.exe” -shp=”Y:\\SkyDrive\\SSIS Source\\Census Tiger\\Place 2013\\tl_2013_01_place.shp” -connstr=”Data Source=MyDatabaseServer;Initial Catalog=Project; Uid=username;Pwd=password;” - oidname=”RowID” -append=”true” -table=”USPLC2013”
It’s ugly, right. But then BAT files aren’t normally pretty. So let’s break down my example down in more detail.
Documentation on the Command Line option is limited to what you see from running Shape2SQL.exe/Help.
As you might imagine, the Command Line option is not a very forgiving process. Errors are difficult to detect, save for an empty destination table at the end of a run, or are hard to decipher. Here are a few tips, based on my experience with this approach.
First, start with a temporary destination table in your Microsoft SQL Server 2008+ database. In my example, the destination table is named “USPLC2013” as in “US Places, 2013.” Be sure to include all columns in the dBase file that accompanies the Shapefile. Also be sure that the data types, especially those with string content, are correctly sized. Otherwise, the BAT file could bomb due to a truncation error, except you will not necessarily know that from the feedback. Also, unless you really know your data, allow all the columns to be nullable.
Create the table in the “dbo” schema. (As far as I have determined you do not have any other options.)
Comparing my BAT file to the options available per the Help screen, you can see that I skipped a few options. By default, the Shape2SQL utility assumes the spatial data will be loaded into column named “geom” and defined with geometry data type. So that’s what I used in my table DDL, thereby allowing me to omit three options: ”geomname,” “srid,” and “geography.”
Include a row number like column using an “identify” data type. No need to constrain it as a Primary Key. I often use RowID as the column name for simplicity.
If you are loading huge amounts of records and files, be sure on your computer/VM to turn off settings that might put it to sleep in-mid processing. If you are loading billions of records via batch file processes, expect the processing to go on for days. (Since the Shape2SQL executable is lightweight, you could consider simultaneously queuing up multiple batch files targeted at multiple work tables.)
The DDL for my destination table looked like so:
CREATE TABLE [dbo].[USPLC2013]
[RowID] [int] IDENTITY(1,1) NOT NULL,
[STATEFP] [nvarchar](255) NULL,
[PLACEFP] [nvarchar](255) NULL,
[PLACENS] [nvarchar](255) NULL,
[GEOID] [nvarchar](255) NULL,
[NAME] [nvarchar](255) NULL,
[NAMELSAD] [nvarchar](255) NULL,
[LSAD] [nvarchar](255) NULL,
[CLASSFP] [nvarchar](255) NULL,
[PCICBSA] [nvarchar](255) NULL,
[PCINECTA] [nvarchar](255) NULL,
[MTFCC] [nvarchar](255) NULL,
[FUNCSTAT] [nvarchar](255) NULL,
[ALAND] [bigint] NULL,
[AWATER] [bigint] NULL,
[INTPTLAT] [nvarchar](255) NULL,
[INTPTLON] [nvarchar](255) NULL,
[geom] [geometry] NULL
ON [Secondary] TEXTIMAGE_ON [Secondary]
After you’ve imported all 56 files into the “work” table, then it’s simple enough to restructure, rename, transform the data into a production ready table.
The DDL for my final, “production” ready table looked like so:
CREATE TABLE [Census].[PlaceGeography]
[Feature ID] [int] NOT NULL,
[State FIPS] varchar(2) NOT NULL,
[Place FIPS] [varchar](5) NULL,
[ST] [varchar](2) NULL,
[Place Name] [varchar](100) NOT NULL,
[Legal Place Name] [varchar](120) NOT NULL,
[Territory Indicator] [bit] NOT NULL,
[Class FIPS Code] [varchar](2) NULL,
[Feature Class] varchar(50) NULL,
[Legal/Statistical Area Code] [varchar](2) NULL,
[MAF/TIGER Feature Class Code] [varchar](5) NULL,
[Functional Status Code] [varchar](1) NULL,
[Current MSA Principal City Indicator] bit NOT NULL,
[Current NECTA Principal City Indicator] bit NOT NULL,
[Edition Date] [char](4) NULL,
[Source] [varchar](20) NULL,
[Land Area (meters)] [bigint] NULL,
[Water Area (meters)] [bigint] NULL,
[GeogAreaInSqrMeters] AS ([GeogBoundary].[STArea]()) PERSISTED,
[GeogPoints] AS ([GeogBoundary].[STNumPoints]()),
[GeogType] AS ([GeogBoundary].[STGeometryType]()),
[GeogIsValid] AS ([GeogBoundary].[STIsValid]()) PERSISTED,
[GeogBoundary] [geography] NOT NULL,
CONSTRAINT [PK_PlaceGeography] PRIMARY KEY CLUSTERED ( [Feature ID] ASC)
ON [Secondary] TEXTIMAGE_ON [Secondary]
Your time is precious. Take some extra time to decipher and use the Command Line option of Shape2SQL utility. It’s tricky at first, but once you get the hang of it, you will save countless hours and hours of time loading Shapefiles.