As TuSimple develops industry-leading autonomous trucking technology, safety is our top priority. Autonomous vehicles have the potential to be one of the biggest improvements in transportation safety ever, and they require safe operations and safe technology development. Exhaustively finding and fixing potential safety issues is a critical part of building this technology, which has the potential to save thousands of lives every year. To do this, TuSimple deploys layers of verification and safety measures to identify, capture, and track issues through safe resolution.
The goal is to catch all safety-related items at the earliest point possible in the design, verification, and testing process. For those interested in learning what it takes to build safety into complex autonomous driving systems, this article gives a glimpse into our cutting-edge safety best practices.
Safety begins with identification of possible weaknesses that need analysis and resolution. This is a multifaceted process that involves the entire development team.
Identifying potential issues and bringing them to the forefront is the responsibility of every TuSimple employee. Specifically:
Every employee is encouraged and responsible for reporting gaps or problems into our ticketing system.
We utilize a combination of tools to root out engineering problems. This is because each tool provides a different lens for a different level of coverage and details. When combined, TuSimple achieves a multifaceted viewpoint of the system’s current capabilities.
Detections occur at different points of development, verifications, and operations, and all findings become issues tracked and prioritized in our ticketing system.
Our Systems & Safety Engineering (S&SE) experts use a suite of industry standard tools to methodically shake down the design and close gaps. TuSimple uses the following system level detection methods, among others:
Provides top-down analysis of the interactions between subsystems, functions, and real-world actors. New findings are added as tickets, and when appropriate spawn their own follow-up projects.
Identifies high risk / low controllability failures.items get enhanced focus in downstream testing, such as further testing in our Systems Integration Lab.
Proactively identifies potential hazards, understand their impacts, and track, communicate, and drive mitigations to a safe resolution.
Engineering Design Reviews and Cross-Functional Deep Dives
Ensures requirements are understood and implemented correctly, and that design changes made for a focused area are well integrated into the whole.
Daily Findings Review
Used to review recent findings from Simulation, Regeneration, Systems Integration Lab, track testing, and road testing, to determine the severity, criticality, and likelihood of discovered issues. This review consists of experts in these disciplines, as well as managers, directors, and executives.
Our Independent Verification & Validation (IV&V) experts put new software and firmware through a gauntlet of testing before allowing road testing.
There are many ways that TuSimple provides ongoing, real-time monitoring of system health.
|ADS Health Manager||The ADS Health Manager monitors hardware, compute functions, timing, rationality checks, warning and error thresholds. These errors are logged during every mission and the Triage team follows up by reviewing errors and ensuring they are associated with a new or existing ticket. The Health Manager is a critical continuous monitoring capability that can proactively trigger the truck to pull over when necessary. Health Manager warnings, detected errors, and outcomes are all logged and analyzed post-mission, and findings are added to tracked tickets.|
|Safety Drivers and Technicians||The Safety Drivers and Technicians monitor operation of the ADS before, during, and after an autonomous mission. The technician annotates the data each time they witness any problem, which automatically generates a new ticket for the Triage team to review.|
|Compute System||The Compute Systems (main compute units, vehicle control units) continuously monitor each other and associated hardware subsystems for sanity and proper operation, both triggering appropriate responses and logging issues for the Triage team to review.|
Whenever we have a finding during a mission, TuSimple may, as appropriate, perform the following types of analysis:
|Triage Analysis||Where a team of experts review all available data to localize and group the problem with known issues. Our Triage engineers are experts at reviewing mission data and helping the organization focus on critical or recurring issues.|
|Mission Data to Regeneration||Where we rerun the recorded mission data from an event through the simulation environment to see what would have happened under slightly different circumstances. For example, what if the Safety Driver hadn’t taken over? Would the ADS have reacted in a healthy way? This is also known in the industry as “counterfactual analysis”, meaning, what would have happened if the facts had been slightly different? Regeneration efforts often lead to generating new findings tickets.|
|Mission Data to Simulations||Where we create simulations similar to events seen on a mission. These simulations have purely simulated sensor data, and can be manipulated to make the simulated speeds, timing, and direction slightly worse or better to understand the exact failure point. These simulations often lead to creating new findings tickets. In many cases, the simulations we create to study a specific mission event become part of our ongoing library of regression testing for new versions of the software.|
When potential safety risks are identified, it’s important to rigorously document and to track them to make sure a solution is implemented and verified. Less formal processes could allow issues to fall through the cracks. We use Jira as our primary ticketing system, a tool used throughout safety-critical industries. Jira tickets may be associated with the formally recorded results of any engineering activity, such as design work, testing results, and other related issues.
Failures of significant consequence are added to our Corrective Action Preventive Action (CAPA) tracking and resolution system. This process is managed by our Quality Assurance team as one element of our Quality Management System (QMS).
As typical for safety critical industries, items loaded to the CAPA system are registered for tracking and then evaluated for one or more potential Corrective Actions that immediately correct the situation. Any potential Preventive Actions, those that would have kept the problem from happening in the first place, are evaluated to improve our engineering and review processes. For example, fixing a broken component would be a corrective action, and upgrading the part quality for future builds is a possible preventive action.
When issues are cross-functional, complex, and span multiple disciplines, we may assign an issue or a group of related issues to a named project. Named projects are typically overseen by a Senior Manager or Director level leader, managed by a Program Manager or Technical Program Manager, and are formally managed into the workload as priority activities.
Like other safety critical industries including aerospace, medical, and power generation, autonomous technology is complex and requires layers of safety processes to identify and resolve safety risks. We deploy industry standard analysis, automated tools, and exhaustive simulation to detect and correct risks as early in the development process as possible. These activities are company-wide and we rely on the strength of our whole organization to solve for safety.
Our goal is to improve transportation safety for all road users by delivering the safest driver on the road. Ultimately, through safe engineering, testing, and deployment, we have the opportunity to save thousands of lives every year.