Commit Graph

1 Commits

  • Python: Reduce flaky integration tests and improve CI signal quality (#5454)
    * Enable Ollama integration tests in CI and rename report to Integration Test Report
    
    - Install Ollama, cache models (qwen2.5:0.5b + nomic-embed-text), and start
      server in the Misc integration job for both workflow files
    - Set OLLAMA_MODEL and OLLAMA_EMBEDDING_MODEL env vars so the 5 Ollama tests
      are no longer skipped
    - Rename Flaky Test Report to Integration Test Report throughout (job names,
      artifact names, cache keys, file names, script titles/docstrings)
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Bump Ollama model to qwen2.5:1.5b for better instruction following
    
    The 0.5b model was too small to reliably follow simple prompts like
    'Say Hello World', causing test assertion failures. The 1.5b model
    follows instructions more reliably while still being small enough
    for fast CI pulls (~1GB).
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Re-enable reliable streaming integration tests
    
    Remove the hard skip on test_03_reliable_streaming tests that was
    temporarily disabled for instability investigation. CI infrastructure
    (Azurite, DTS emulator, Redis, func CLI) is already in place.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Re-enable skipped Functions/DurableTask tests and bump timeout to 480s
    
    - Remove hard skips from 4 tests in test_11_workflow_parallel.py
    - Remove hard skip from test_conditional_branching in test_06_dt_multi_agent_orchestration_conditionals.py
    - Increase pytest --timeout from 360 to 480 for Functions+DurableTask CI job
    - Updated in both python-merge-tests.yml and python-integration-tests.yml
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Re-skip failing Functions/DurableTask tests with specific root causes
    
    - test_11_workflow_parallel (4 tests): xdist worker crashes during execution
    - test_conditional_branching: orchestration fails with RuntimeError, not a timeout
    - Keep 480s timeout bump for remaining Functions tests
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Fix auth routing in samples 06/11: api_key -> credential for Azure OpenAI
    
    Both samples passed a bearer token provider via api_key= which caused the
    client to route to api.openai.com instead of Azure OpenAI, resulting in
    401 Unauthorized. Changed to credential= which correctly triggers Azure
    routing and picks up AZURE_OPENAI_ENDPOINT from the environment.
    
    - samples/azure_functions/11_workflow_parallel/function_app.py: 1 fix
    - samples/durabletask/06_multi_agent_orchestration_conditionals/worker.py: 2 fixes
    - Re-enable 4 parallel workflow tests and 1 conditional branching test
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Re-skip parallel workflow tests: xdist worker distribution issue
    
    The 4 parallel workflow tests crash because xdist worksteal distributes
    them across separate workers, each spawning its own func process against
    shared emulators. Auth fix (api_key->credential) was valid and stays.
    test_conditional_branching now passes with the auth fix.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Fix E501 line-too-long in azurefunctions parallel test skip reasons
    
    Wrap skip reason strings to stay within 120 char line limit.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Add retry logic and port-conflict fix for Ollama CI setup
    
    - Kill any auto-started Ollama before launching serve (fixes port
      conflict: 'address already in use')
    - Retry ollama pull up to 3 times with 15s backoff (fixes 429 rate
      limit failures)
    - Applied to both python-merge-tests.yml and python-integration-tests.yml
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Fix flaky integration tests and re-enable skipped tests
    
    - Foundry agent: add allow_preview=True to custom client test
    - Foundry hosting: raise max_output_tokens 50->200, add temperature,
      relax assertion in test_temperature_and_max_tokens
    - Foundry embedding: update skip reason with root cause (endpoint mismatch)
    - OpenAI file search: fix vector store indexing race condition by polling
      file_counts before querying; fix get_streaming_response -> get_response(stream=True)
    - Azure OpenAI file search: remove skip (transient 500 resolved)
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Remove temperature from foundry hosting test (unsupported by CI model)
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Stabilize Ollama tool call integration tests with no-arg function
    
    Use a no-argument greet() function instead of hello_world(arg1) for
    integration tests. The 1.5B model in CI is unreliable at generating
    correct tool call arguments, causing 'Argument parsing failed' errors.
    A no-arg function eliminates this flakiness entirely.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Increase reliable streaming test timeouts from 30s to 60s
    
    The LLM call through Azure OpenAI + Redis streaming pipeline can exceed
    30s in CI due to cold starts or throttling. Raise to 60s to reduce
    flaky timeouts while still bounded by pytest's 120s per-test limit.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Re-enable workflow parallel tests with xdist_group marker
    
    The tests were skipped because xdist distributes module tests across
    workers, each spawning their own func process (port conflicts). Adding
    xdist_group forces all tests in this module onto a single worker so
    the module-scoped function_app_for_test fixture works correctly.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Revert "Re-enable workflow parallel tests with xdist_group marker"
    
    This reverts commit 455c28da62.
    
    * Rename flaky_report to integration_test_report and add try/finally cleanup
    
    - Rename scripts/flaky_report/ to scripts/integration_test_report/ to
      reflect expanded scope beyond flaky-test detection
    - Update workflow references in both CI files
    - Wrap file search integration tests in try/finally to ensure vector
      store cleanup runs even on test failure or timeout
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Fix Ollama pull failure propagation and Azure OpenAI vector store readiness
    
    - Ollama CI: fail the step immediately if model pull fails after 3
      retries instead of silently proceeding to tests
    - Azure OpenAI file search: add the same vector-store readiness polling
      that was applied to the non-Azure OpenAI tests, preventing eventual
      consistency race conditions
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * remove load_dotenv from test file
    
    ---------
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>