Rust Hiring for Yatima Inc.

Narrow version of abstract art for this blogpost
Tue, Dec 06, 2022

Summary๐Ÿ”—

  • A research Y-Combinator startup was expanding and needed to hire several developers.
  • They needed to make sure that the candidates they had (10 candidates per open position) would fit well skill-wise.
  • โˆ…hr has offered them a Rust test task, which they inspected and felt would work for the purpose.
  • โˆ…hr has processed candidate submissions:
    • Either suggesting the candidate to attempt another submission because the first one wasn't satisfactory.
    • Or accepting the submission and providing the research startup with feedback:
      • About the proficiency level (junior, middle, or senior).
      • Signal about their soft skills (diligence, specification comprehension, attention to detail).
  • The research startup has ended up making successful hires which have exceeded expectations.

The challenge of scaling the team๐Ÿ”—

Yatima Inc. has approached us as they were wrapping up the C-round with Y-Combinator. Now that they had the capital, they were expanding the team to speed up the programming language theory research they were conducting. Since the budgets in startups are very much finite, they couldn't afford to hire wrong employees who won't deliver at the required pace. The companies have signed a mutual non-disclosure agreement and โˆ…hr has sent out a rust task specification for Yatima Inc. to evaluate if it's fit for purpose. After they did, the companies have started collaboration.

When asked why did they trust โˆ…hr's signal about whom to hire, Yatima Inc. CEO has answered:

The trust here is not whether or not you trust someone else to make a hiring decision for you, the trust here is whether or not this information puts you in a better position to make a good hiring decision. I looked pretty carefully at the coding task, evaluation process, materials around that like instructions. And I was able to, within an hour, verify myself that this is at least as good if not better than what I would come up with after 20-30 hours. And I wouldnโ€™t make any out of automation stuff.

Feedback about the task๐Ÿ”—

While surveying the reasons Yatima Inc. was happy with the test task, the following key points were found:

  • The packaging of the task.
    • PDF specification & git repository where everything is clearly labeled, which makes it easy to:
      • Send the package to the candidates.
      • For the candidates to import it into their preferred IDE.
      • Fill in the implementation that matters for the evaluation.
  • Automated testing.
    • After the candidate sends back the submission, โˆ…hr runs it through the tester.
    • There is an overall dashboard to see which candidates performed well overall.
    • As well as a clearly expressed set of outputs saying:
      • "This candidate has passed these tests, this candidate didn't pass these tests".
      • The tests were also named in an evocative manner.
  • The structure of the tests.
    • Tests have a lot of corner cases.
    • They test for nuances in the specification.
    • โˆ…hr sends tons and tons of various test inputs and see what outputs the filled in function produces.
    • This reflects programmer's work in an industrial setting perfectly.

Candidate Ranking Pipeline๐Ÿ”—

Step 1: Send out the Task๐Ÿ”—

Task specifications and the soluions skeleton were sent out to the candidates by the customer and questions from the candidates were answered by โˆ…hr. The customer also had a set of standard, frequently asked, questions about the task (provided by โˆ…hr), which helped them answer candidates' questions by themselves.

Hanooy Maps Rust Test Task Example Hanooy Maps Git Repository Snapshot

In the starter E-Mail, the customer asked the candidates to send back a ZIP archive with their submissions for the checking to commence.

Step 2: The Candidate Solves the Task๐Ÿ”—

To solve the task, the candidate had to be proficient with development tools. They had to extract the task skeleton and import the project into their IDE.

Then they were expected to quickly produce a simple solution and test that it works. Afterwards, they were expected to improve the simple solution to perform significantly better than naive ones.

After the task was solved, the candidate had to create an archive out of their project directory and send it back for evaluation.

Step 3: Evaluation๐Ÿ”—

โˆ…hr has ranked the sumissions automatically, using continuous and discrete scoring.

Continuous Scoring๐Ÿ”—

All of the challenges are designed to provide a continuous measure of technical proficiency. This means a junior developer might be expected to score 1000 points, whereas a senior could achieve up to 50,000 points. It is achieved by having NP-complete problems at the core of the test tasks. We also made sure that smart heuristics and industry-inspired optimizations can provide a significant score boost.

Discrete Evaluation Flags๐Ÿ”—

Apart from continuous performance scores that demonstrate overall competence of an IT professional, specific markers were integrated into the challenges. These flags assess attributes like diligence, specification comprehension, and the ability to discern nuances in tasks. The evaluation flags are designed to provide a holistic assessment of a candidate's skills, beyond just technical proficiency. This sort of evaluation has informed the method of testing the submissions. We allow each candidate up to two attempts: first is expected to fail some of these discrete flags, whereas the second is there to check that the candidate is capable of fixing imprecisions.

Running Submissions๐Ÿ”—

โˆ…hr automatically ran the submissions programmatically creating Docker containers and running those. Then it collected the outputs and assigned scores, both continuous and discrete, to each invocation of the submissions. The results of the evaluation were considered to be final whenever the automatic system determined that they are statistically stable. It checked for that by normalizing the scores against the canonical submission (authored by โˆ…hr) and, in case of too big of a deviation, rerunning given submission until the results converged.

Then a summary of each submission in plain English was generated and, for special cases, an inspection by one of โˆ…hr task engineers was commenced in order to provide additional feedback.

The results were then sent to the customer via E-Mail and were visible in the customer's dashboard, credentials to which were shared with the customer during the beginning of the collaboration.

Examples of Communication๐Ÿ”—

Here we provide example communication between two candidates. Candidate A has ranked poorly during the preliminary screening call that the customer has conducted because they didn't present themselves well and were uncertain in the way they spoke about their skills. Whereas Candidate B has ranked highly during such screening call because they were assertive and held a senior position at their then-current workplace.

Feedback #1, Candidate A: Failing the Placement of Ziggurats๐Ÿ”—

We suggest the following response to the candidate:

Dear Candidate,

sadly, the output you have generated fails primality checks in the basic tests: Azcanta, Catalcan, Metzali.

It normally means one of the two things: either you have incorrect coordinate calculation or, perhaps, your layer compatibility function checks the sequences of layers in a wrong direction.

We encourage you to send a follow-up submission with the problem fixed. To help you test it, please use the reference inputs and outputs attached.

Response #1, Candidate A๐Ÿ”—

Fixed Rust solution attached. Sorry, I didn't read the requirements carefully enough!

Feedback #2, Candidate A: Succeeding the Test๐Ÿ”—

Great news!

The second submission of this candidate is correct and it gives results well distinguishable from those of the naive implementation. We think that this person can be interviewed for a junior to a middle position.

We're suggesting a middle position because the candidate managed to implement the solver exactly to specification, catching all the corner cases, as can be seen in Azcanta test, which for many candidates is ten-points-off due to a nuanced input.

We also see that the program submitted doesn't choke on huge inputs.

Some candidates' programs show lower scores for Itlimoc (the huge map allowing for gigantic clusters), than on regular big and large maps like Aclazotz and Adanto.

Please remember that it's perfectly reasonable and expected that the candidate didn't manage to submit a perfect solution on the first try.

Attaching the score sheet from the dashboard for your convenience:

Sanity     27            27            100%
Azcanta    385           385           100%
Catalcan   5057          5057          100%
Metzali    9934          9934          100%
Aclazotz   250000        250000        100%
Adanto     231836        4204100       5.52%
Itlimoc    285747        2558972010    0.01%

Feedback #1, Candidate B: Failing to Find the Intentional Bugs๐Ÿ”—

We suggest the following response to the candidate:

It seems like you have failed to find the intentional bugs in the submission skeleton.

It's absolutely normal and is a part of the process.

We encourage you to send a follow-up submission with the problem fixed. To help you test it, please use the reference inputs and outputs attached.

Response #1, Candidate B๐Ÿ”—

I fixed two problems, which should probably take care of most of the errors. I also fixed some of issues in the template like it was not able to detect ziggurats that are of single color.

My other assumptions are still the same.

Would be happy to work on the task more. As you pointed out, I have not ever coded in Rust and the syntax need(s/ed) some getting used to.

Feedback #2, Candidate B:๐Ÿ”—

The submission has failed to produce output and was terminated by the task scheduler.

To ensure fairness and lack of problems on our end, โˆ…hr team has conducted an investigation, here are the findings:

This submission prints debug info into STDOUT and doesn't pass the smoke tests.

Sadly it doesn't even pass the smoke test which we distribute in the sample inputs and outputs package.

Here's the way candidate's second submission fails the basic smoke test. The output on the right is the candidate's submission, and the output on the bottom left is the reference.

(Outputs included)

All in all, we can't say that this candidate should be seriously considered for Rust developer position of any seniority.

Step 4: Social and Technical Interview Between the Highest Ranking Candidates and the Customer๐Ÿ”—

After the top candidates were selected by โˆ…hr, a short interview with each was conducted by the customer's HR and an engineer. The purpose of these interviews was to determine if the candidate is a good fit for the team and ask some questions about their submissions or the exact stack that was used in the customer company.

Here is what the CEO said about the interviews before and after testing:

Some of the candidates interviewed well, some didnโ€™t. We were a small team and we wouldnโ€™t be able to rank them objectively. Being able to look at peopleโ€™s results in a more standardized way caused us to not make an offer to a person who did well in the [initial] interview, but did poorly in the test.

Conversely, โˆ…hr helped us make an offer to someone who didnโ€™t do well during the [initial] interview, was shy and nervous, but they blew the test out of the water and ended up being an incredible contributor to our entire project.

I think about what would have happened if we made an offer to the first person rather than the second, and I think that we wouldโ€™ve been very behind in our project. Because making a wrong hire is very expensive. Not making the right hire is also very expensive in the opportunity cost.

Conclusion๐Ÿ”—

The recruitment process is multifaceted and cannot be solely reliant on traditional interview methods, especially for roles demanding specialized skills. โˆ…hr's standardized testing approach offered an objective measure, allowing the team to assess candidates' competencies beyond surface impressions. This objectivity proved invaluable, highlighting that an interviewee's comfort and eloquence during discussions do not necessarily equate to their on-the-job performance and vice versa. The experience underscored the significance of using a combination of assessment tools to make informed hiring decisions. By prioritizing skill over surface impressions, the team was able to onboard a valuable contributor, avoiding potential setbacks and losses. This case serves as a reminder of the high costs associated with hiring missteps, both in terms of direct expenses and missed opportunities, emphasizing the importance of a holistic hiring approach.