# XPath Injection via lxml XPath Predicates

Language: Python
Severity: High
CWE: CWE-643

## Source
8,9

## Flow
8-9-11

## Sink
11

## Vulnerable Code
```python
from lxml import etree
import flask

app = flask.Flask(__name__)

@app.route('/iot/device_telemetry')
def fetch_device_metrics():
    device_id = flask.request.args.get('device_id', '')
    sensor_type = flask.request.args.get('sensor', 'temperature')
    xml_data = etree.parse('telemetry_data.xml')
    query = f"//device[@id='{device_id}']/sensors/sensor[@type='{sensor_type}']/reading"
    results = xml_data.xpath(query)
    metrics = [etree.tostring(r, encoding='unicode') for r in results]
    return flask.jsonify({'device': device_id, 'metrics': metrics})
```

## Explanation

The application constructs an XPath query by directly concatenating unsanitized user inputs (device_id and sensor_type) into the query string. An attacker can break out of the XPath predicate context and inject arbitrary XPath expressions to extract unauthorized data or bypass access controls.

## Remediation

The fix uses lxml's built-in XPath parameterization (XPath variables via keyword arguments to the xpath() method) which safely handles user input without string concatenation. Additionally, an input validation function using a whitelist regex ensures that only alphanumeric identifiers (with hyphens and underscores) are accepted, providing defense-in-depth against injection attacks.

## Secure Code
```python
from lxml import etree
import flask
import re

app = flask.Flask(__name__)

def validate_identifier(value):
    """Validate that the input is a safe alphanumeric identifier."""
    if not re.match(r'^[a-zA-Z0-9_\-]+$', value):
        return None
    return value

@app.route('/iot/device_telemetry')
def fetch_device_metrics():
    device_id = flask.request.args.get('device_id', '')
    sensor_type = flask.request.args.get('sensor', 'temperature')
    
    # Validate inputs to prevent XPath injection
    device_id = validate_identifier(device_id)
    sensor_type = validate_identifier(sensor_type)
    
    if device_id is None or sensor_type is None:
        return flask.jsonify({'error': 'Invalid input parameters'}), 400
    
    xml_data = etree.parse('telemetry_data.xml')
    
    # Use parameterized XPath with XPath variables to prevent injection
    query = "//device[@id=$device_id]/sensors/sensor[@type=$sensor_type]/reading"
    results = xml_data.xpath(query, device_id=device_id, sensor_type=sensor_type)
    
    metrics = [etree.tostring(r, encoding='unicode') for r in results]
    return flask.jsonify({'device': device_id, 'metrics': metrics})
```
