Python Generators to the Rescue

For one of my projects, I had a Python script generate some GeoJSON data from database records.

Under the original load, the script worked fine, but with an expanded dataset, it became a memory hog and couldn't run to completion on our servers.

Fortunately, I could use a generator function to alleviate the memory issues.

What are Python Generators?

Traditional functions return their results once execution is complete. A generator function returns results as it executes.

You can iterate over the return value of a generator function, processing results as they are produced, rather than waiting for the complete result list to be calculated before iterating.

The Original Approach: A Memory-Intensive Function

The initial version of the GeoJSON-calculating function returned a complete GeoJSON feature collection with thousands of features as a Python dictionary containing a list of features.

That Python dictionary was then converted to JSON and written to a file. With large data volumes, the JSON became too large for the available memory.

Here's a simplified example:

def process_large_dataset(db_connection):
    # Loads entire dataset into memory at once
    all_features = []

    # Query database for records
    cursor = db_connection.cursor()
    cursor.execute("SELECT id, latitude, longitude, name, type FROM locations")

    # Process each record into a GeoJSON feature
    for record in cursor.fetchall():
        feature = {
            'type': 'Feature',
            'geometry': {
                'type': 'Point',
                'coordinates': [record[1], record[2]]  # [longitude, latitude]
            },
            'properties': {
                'id': record[0],
                'name': record[3],
                'location_type': record[4]
            }
        }
        all_features.append(feature)

    # Return complete GeoJSON structure
    return {
        'type': 'FeatureCollection',
        'features': all_features
    }

# Usage - memory intensive approach
geojson_data = process_large_dataset(db_connection)

# Write to file
with open('locations.geojson', 'w') as f:
    json.dump(geojson_data, f)

The Solution: Generator-Based Processing

To fix this, I converted the GeoJSON function into a generator.

The function looped through the relevant database records and generated a GeoJSON feature for each record. Instead of appending that feature to a list, it yielded the individual feature.

Here's a simplified example of the new function:

def generate_geojson_features(db_connection):
    """Generator that yields one GeoJSON feature at a time"""

    # Use server-side cursor for efficient memory usage with large datasets
    cursor = db_connection.cursor(name='server_side_cursor')
    cursor.execute("SELECT id, latitude, longitude, name, type FROM locations")

    # Fetch and yield one record at a time
    for record in cursor:
        yield {
            'type': 'Feature',
            'geometry': {
                'type': 'Point',
                'coordinates': [record[1], record[2]]  # [longitude, latitude]
            },
            'properties': {
                'id': record[0],
                'name': record[3],
                'location_type': record[4]
            }
        }

    # Clean up cursor
    cursor.close()

# Usage with batching to maintain valid GeoJSON structure
def write_geojson_in_batches(db_connection, filename, batch_size=1000):
    """Write GeoJSON file using batched processing to limit memory usage"""

    with open(filename, 'w') as f:
        # Write GeoJSON opening
        f.write('{"type": "FeatureCollection", "features": [\n')

        feature_generator = generate_geojson_features(db_connection)
        first_feature = True

        while True:
            # Process a batch of features
            batch = []
            try:
                for _ in range(batch_size):
                    batch.append(next(feature_generator))
            except StopIteration:
                # No more features to process
                pass

            # If batch is empty, we're done
            if not batch:
                break

            # Write batch to file with proper JSON formatting
            for feature in batch:
                if first_feature:
                    first_feature = False
                else:
                    f.write(',\n')  # Add comma between features

                # Write the individual feature as JSON
                f.write(json.dumps(feature))

            # Clear batch from memory
            batch = None

        # Close the GeoJSON structure
        f.write('\n]}')

# Usage
write_geojson_in_batches(db_connection, 'locations.geojson', batch_size=1000)

Practical Memory Management

The script calling the GeoJSON function would collect a batch of the yielded GeoJSON features and write the batch to a file. Thus, the number of features in memory never exceeded the batch size.

Note that the code to write the GeoJSON file became slightly more complex; however, the reduced memory usage was more than worth it.

Generators for the Win

After this simple optimization, the script could run on our servers without eating all the available memory.

This approach is useful anytime you are working with large data volumes. For instance, scrape 10,000 web pages and feed their content to an LLM. Happy generating!

Services

Case Studies

Technologies

Industries