How Far Away is Artificial Intelligence from Replacing Product Managers?

In Lenny's newsletter, Mike Taylor explores the development of artificial intelligence (AI) in the field of product management and attempts to assess whether AI can replace human product managers through practical case studies. He collaborates with Lenny to utilize prompt engineering technology to test AI's performance in handling difficult product manager tasks. Through blind review assessments on Twitter and X platform, they found that AI's performance in three specific tasks is quite competitive. These tasks include formulating product strategy, defining performance indicators, and estimating the return on investment (ROI) of feature ideas. The results show that in two tasks, AI-generated answers were rated better than human answers, especially in defining performance indicators, where AI's answers received the majority of support. Mike Taylor also shares how he optimizes AI output through prompt engineering and his views on the potential development of AI in the field of product management in the future. He believes that with the continuous advancement of AI technology, it may become a powerful assistant to product managers in the near future and may even replace humans in certain areas.

I. Research Background and Methodology
Challenges in Assessing AI Capabilities
The rapid development of the AI industry, but existing abstract benchmarks cannot illustrate the extent to which it can replace work. Expert prompt engineering is crucial for obtaining good model responses, and most people may underestimate the ability of tools like ChatGPT to replace human labor.

Testing Method
To practically measure the extent to which AI models can replace the work of product managers, the author collaborates with professional prompt engineer Mike Taylor. They collect instances of product manager tasks that AI seems to struggle with, use the current best models, and apply prompt engineering principles. They assess the performance of AI and humans in tasks through blind voting on X (Twitter) without revealing which answer is AI-generated.

II. Selection of Difficult Tasks and Testing Process
Sources of Difficult Tasks
Lenny collects product manager tasks that ChatGPT failed to do on social media and selects three tasks with significant impact on daily work: formulating product strategy, defining key performance indicators (KPIs), and estimating the ROI of feature ideas.

Testing Process
For each task, the author writes prompts based on how humans might answer, obtains AI answers on OpenAI Playground, and then compares AI answers with human answers from Exponent website (which is widely used by many product managers for interview preparation). They let users on X vote for the better answer without revealing which one is AI-generated.

III. Test Results and Analysis
Task One: Formulating Product Strategy
Test Scenario: Formulate the strategy for YouTube Music for the next year.
Results: The AI version's answer (Plan B) won with a 55% vote rate (including tie votes), although 77% of people correctly guessed it was an AI answer. The human answer came from an interview response on Exponent's website. The main criticism of the AI answer is that it resembles a feature list rather than a real strategy, but this may improve as reasoning capabilities develop.

Task Two: Defining Key Performance Indicators
Test Scenario: Determine the most important metrics for DoorDash.
Results: The AI version's answer (Plan A) won with a 68% vote rate, with 70% of people correctly guessing it was AI. In this task, AI provided a more comprehensive answer, although measures were taken to reduce verbosity, it was still quite apparent. This suggests that for such tasks, sometimes a more detailed answer from AI might be better than humans.

Task Three: Estimating Return on Investment
Test Scenario: Measure the success of Meta's upcoming new Jobs feature in the short and long term.
Results: The human version's answer (Plan A) won with a 58% vote rate, which was a close race, with many people indicating they chose A for only minor reasons, and only 65% of people correctly guessed which one was AI. If only the AI answer had numerical estimates, AI might have won.

IV. Conclusions and Future Directions
Overall Conclusion
AI answers won in two out of three tasks, indicating that AI has shown some capability in certain product manager tasks, but AI is still not able to independently serve as a product manager. It is also important to remember that AI capabilities will continue to improve.

Future Directions
It is necessary to expand these benchmark tests to more types of difficult product manager tasks and further explore the performance of different models (such as Claude 3.5, Google Gemini 1.5) as well as the impact of internet access on model capabilities. At the same time, it is necessary to address data pollution issues and improve the voting mechanism to increase the accuracy and reliability of assessments.

Prompt Engineering Techniques
The author introduces the process and techniques used to create prompts, including finding examples of how humans complete tasks, converting them into specific structures, asking AI to role-play, listing assumptions, etc. These techniques help improve AI performance.