Introduction:
|
Fundamentals:
|
ConvNets (Part 1):
|
ConvNets (Part 2):
|
Embeddings, Recurrent Neural Networks, and Sequences (Part 1):
|
Embeddings, Recurrent Neural Networks, and Sequences (Part 2):
|
Generative Models:
|
Reinforcement Learning:
- Slides
- Variations on the DQN ...
- https://github.com/mimoralea/gdrl/blob/master/notebooks/chapter_10/chapter-10.ipynb
- class FCDuelingQ(nn.Module): # fully connected: the state value and action advantage are dueling
- q = v + a - a.mean(1, keepdim=True).expand_as(a)
- class DuelingDDQN():
- argmax_a_q_sp = self.online_model(next_states).max(1)[1] # online model selects action
- q_sp = self.target_model(next_states).detach() # target model estimates action value
- mixed_weights = target_ratio + online_ratio
- class PrioritizedReplayBuffer():
- self.memory[idxs, self.td_error_index] = np.abs(td_errors)
- sorted_arg = self.memory[:self.n_entries, self.td_error_index].argsort()[::-1] # sorted by magnitude of TD error
- if self.rank_based:
- priorities = 1/(np.arange(self.n_entries) + 1)
- else: # proportional
- priorities = entries[:, self.td_error_index] + EPS
- scaled_priorities = priorities**self.alpha
- probs = np.array(scaled_priorities/np.sum(scaled_priorities), dtype=np.float64)
- weights = (self.n_entries * probs)**-self.beta
- normalized_weights = weights/weights.max()
- What is the role of weighted importance sampling for TD error?
- Vanilla Policy Gradient (aka REINFORCE with baseline) ...
- https://github.com/mimoralea/gdrl/blob/master/notebooks/chapter_11/chapter-11.ipynb
- class FCDAP(nn.Module): # fully connected discrete-action policy
- dist = torch.distributions.Categorical(logits=logits)
- action = dist.sample()
- class VPG():
- value_error = returns - self.values
- policy_loss = -(discounts * value_error.detach() * self.logpas).mean()
- entropy_loss = -self.entropies.mean()
- loss = policy_loss + self.entropy_loss_weight * entropy_loss # updates the policy model parameters
- value_loss = value_error.pow(2).mul(0.5).mean() # updates the state value model parameters
- How is the return (trajectory rewards) affecting parameter updates for the policy?
|
Misc:
|
Homework Questions:
|